CN115836346A

CN115836346A - In-memory computing device and data processing method thereof

Info

Publication number: CN115836346A
Application number: CN202080102722.1A
Authority: CN
Inventors: 张雅文; 关天婵; 范小鑫; 王雨豪; 郑宏忠; 李双辰; 柳春笙
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2023-03-21
Also published as: WO2022047802A1

Abstract

The present disclosure relates to an in-memory computing device. The in-memory computing device includes a memory array configured to store data and computing circuitry configured to execute a set of instructions to cause the in-memory computing device to perform the steps of: selecting between a plurality of computing modes including first and second ordering modes based on a configuration from a host, the host communicatively coupled with the in-memory computing device; accessing data elements in a memory array of an in-memory computing device; and outputting the first K data elements in the data elements to a memory array or a host in the first or second sorting mode. K is an integer greater than the threshold if the first sorting mode is selected, and is an integer less than or equal to the threshold if the second sorting mode is selected.

Description

In-memory computing device and data processing method thereof

Background

Similarity search has been widely applied to various computing fields including multimedia databases, data mining, machine learning, and the like. the top-K function may be applied in the task of similarity search to find the K most similar or K least similar elements among the given elements (e.g., N elements). For example, the top-k function is used for a fast-convolution neural network (RCNN) or the like. Traditionally, the top-k function is implemented using software.

However, conventional software implementations of the top-k function cannot handle large numbers of elements in a reasonable time frame and are therefore not suitable for applications where latency is critical. With the rapid increase in database size, the large amount of data transfer between processing units and storage devices becomes a performance bottleneck for the top-k function due to limited storage performance.

Disclosure of Invention

An embodiment of the present disclosure provides an in-memory computing (PIM) device. An in-memory computing device includes a memory array and a compute circuit. The memory array is configured to store data. The computing circuitry is configured to execute the set of instructions to cause the in-memory computing device to perform the steps of: selecting, based on a configuration from a host, between a plurality of computing modes including a first ordering mode and a second ordering mode, the host communicatively coupled with an in-memory computing device; accessing data elements in a memory array of an in-memory computing device; and outputting the first K data elements of the data elements to the memory array or the host in the first sorting mode or the second sorting mode. K is an integer greater than the threshold if the first sorting mode is selected, and is an integer less than or equal to the threshold if the second sorting mode is selected.

The embodiment of the disclosure also provides a data processing method. The data processing method comprises the following steps: selecting between a plurality of computing modes configured with the in-memory computing device, the plurality of computing modes including a first ordering mode and a second ordering mode; accessing a plurality of data elements in a memory array of an in-memory computing device; and outputting, in a first ordering mode or a second ordering mode, a top K data elements of the plurality of data elements to a memory array or to a host communicatively coupled to the in-memory computing device, wherein K is an integer greater than a threshold if the first ordering mode is selected and an integer less than or equal to the threshold if the second ordering mode is selected.

Embodiments of the present disclosure also provide a non-transitory computer-readable storage medium. A non-transitory computer-readable storage medium stores a set of instructions executable by one or more computing circuits of an apparatus to cause the apparatus to begin performing a data processing method. The data processing method comprises the following steps: selecting between a plurality of computing modes based on the configuration, the plurality of computing modes including a first ordering mode and a second ordering mode; accessing a plurality of data elements in a memory array of a device; and outputting the first K data elements of the plurality of data elements to a memory array or to a host communicatively coupled to the device in the first ordering mode or the second ordering mode. K is an integer greater than the threshold if the first sorting mode is selected, and is an integer less than or equal to the threshold if the second sorting mode is selected.

The embodiment of the disclosure also provides a data processing system. A data processing system includes a host and a plurality of in-memory computing devices communicatively coupled to the host. Any of the plurality of in-memory computing devices includes a memory array configured to store data and computing circuitry configured to execute sets of instructions to cause the in-memory computing device to perform the steps of: selecting between a plurality of computing modes based on a configuration from a host, the plurality of computing modes including a first ordering mode and a second ordering mode; accessing a plurality of data elements in a memory array of an in-memory computing device; and outputting the first K data elements in the plurality of data elements to the host in the first sorting mode or the second sorting mode. K is an integer greater than the threshold if the first sorting mode is selected, and is an integer less than or equal to the threshold if the second sorting mode is selected.

Additional features and advantages of the disclosed embodiments will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the embodiments. The features and advantages of the disclosed embodiments may be realized and obtained by means of the elements and combinations set forth in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the embodiments disclosed, as claimed.

Drawings

FIG. 1 illustrates the structure of an exemplary in-memory computing block consistent with certain embodiments of the present disclosure;

fig. 2A illustrates an exemplary neural network accelerator architecture consistent with some embodiments of the present disclosure;

fig. 2B illustrates a schematic diagram of an exemplary cloud system including a neural network accelerator architecture, consistent with some embodiments of the present disclosure;

FIG. 3 illustrates the structure of an exemplary memory slice consistent with certain embodiments of the present disclosure;

FIG. 4 illustrates an exemplary in-memory computing processing unit consistent with certain embodiments of the present disclosure;

FIG. 5 illustrates exemplary operations performed by the in-memory compute processing unit of FIG. 4 for a top-k ordering method consistent with certain embodiments of the present disclosure;

FIG. 6 illustrates exemplary operations performed by the in-memory computing processing unit of FIG. 4 for another top-k ordering method, consistent with certain embodiments of the present disclosure;

FIGS. 7A and 7B illustrate exemplary in-memory computing processing units consistent with some embodiments of the present disclosure;

FIG. 8 illustrates an exemplary memory computing based accelerator architecture consistent with certain embodiments of the present disclosure;

FIG. 9 illustrates exemplary operations performed by the in-memory computing processing unit of FIGS. 7A and 7B for similarity search, consistent with certain embodiments of the present disclosure;

FIG. 10 illustrates operations performed by the in-memory computing processing unit of FIGS. 7A and 7B for similarity search that are exemplary with some embodiments of the present disclosure;

FIGS. 11 and 12 illustrate exemplary operations performed by the memory computation processing unit of FIGS. 7A and 7B for k-means (k-means) clustering computations, consistent with certain embodiments of the present disclosure;

FIG. 13 illustrates a flow chart of an exemplary method of performing data processing consistent with certain embodiments of the present disclosure;

FIG. 14 illustrates a flow chart of an exemplary method of performing data processing consistent with some embodiments of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings, in which like numerals refer to the same or similar elements throughout the different views unless otherwise specified. The implementations set forth below in the description of the exemplary embodiments do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus, systems, and methods consistent with various aspects set forth in the claims that follow related to the present disclosure. To the extent of conflict with a term or definition incorporated by reference, the term or definition provided herein shall govern.

Unless expressly stated otherwise, the term "or" encompasses all possible combinations, unless not feasible. For example, if a component is stated to include a or B, the component can include a, or B, or both a and B, unless expressly stated otherwise or not possible. As a second example, if a component is stated to include A, B or C, then the component can include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C, unless expressly stated otherwise or not possible. The term "exemplary" is used in a sense of "exemplary" rather than "optimal.

Today, the size of databases and data processing tasks is growing dramatically and rapidly in a variety of applications. Furthermore, to provide a satisfactory user experience, many applications are intended to meet stringent delay requirements. At present, many simple calculations with high parallelism, such as similarity search and k-means calculation, are limited by the bandwidth and capacity of memory components in the system, which has become one of the main performance bottlenecks.

Embodiments of the present disclosure alleviate the above-described problems by providing an apparatus and method for data processing that performs top-k ranking, k-means clustering, or other similarity search computation. Unnecessary data movement is reduced and efficient parallel computation is realized through the Memory computing technology and the high bandwidth of a Dynamic Random Access Memory (DRAM for short). Therefore, memory performance bottlenecks in similarity search and k-means calculation can be greatly reduced. With the apparatus and method disclosed by the embodiments, despite the increase in data, computation time can be kept within acceptable ranges and the overall performance and efficiency of the various computations improved. The proposed apparatus and method for data processing can be applied to various applications having a large database and a large amount of data processing tasks, including various cloud systems using Artificial Intelligence (AI) computation.

In particular, embodiments disclosed herein may be used in a variety of applications or environments, such as artificial intelligence training and reasoning, database and big data analytics acceleration, and so forth. Artificial intelligence related applications involve neural network based Machine Learning (ML) or Deep Learning (DL). For example, some embodiments may be used in neural network architectures, such as Deep Neural Networks (DNNs), convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), and so on. In addition, some embodiments are configured to support various processing architectures, such as a Data Processing Unit (DPU), a neural Network Processing Unit (NPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Tensor Processing Unit (TPU), an application-specific integrated circuit (ASIC), any other type of heterogeneous accelerator processing unit (pu), and so on.

The term "accelerator" as used herein refers to hardware used to accelerate certain computations. For example, the accelerator is configured to accelerate top-k ranking calculations, k-means clustering calculations, or other calculations performed in the similarity search. In some embodiments, the accelerator may be configured to accelerate workloads (e.g., neural network computing tasks) in any artificial intelligence related application. Accelerators with Dynamic Random-Access Memory (DRAM) or Embedded DRAM (eDRAM) are referred to as DRAM-based or Embedded DRAM-based accelerators.

FIG. 1 illustrates the structure of an exemplary in-memory computing block consistent with certain embodiments of the present disclosure. The memory computing block 100 includes a memory cell array 110, a block controller 120, a block row driver 131, and a block column driver 132. Although some embodiments are described using dynamic random access memory as an example, it should be understood that the memory computing block 100 according to embodiments of the present disclosure may be implemented based on various memory technologies including Static Random Access Memory (SRAM), resistive random access memory (ReRAM), and the like. The memory cell array 110 includes r ₁ To r _m M rows and c ₁ To c _n N columns of (a). As shown in FIG. 1, the storage unit 111 is connected to r ₁ To r _m Each of the m rows of (a) and (b) and (c) ₁ To c _n N columns ofBetween each column in (c). In some embodiments, data may be stored in the crossbar memory as multi-bit memristors.

The block row driver 131 and the block column driver 132 may be driven r ₁ To r _m M rows and c ₁ To c _n Provides a signal, such as a voltage signal, for processing the corresponding operation. In some embodiments, the block row drivers 131 and block column drivers 132 are configured to pass analog signals through the memory cells 111. In some embodiments, the analog signal is converted from digital input data.

The block controller 120 may include an instruction register for storing instructions. In some embodiments, the instructions include instructions when the block row driver 131 or block column driver 132 provides signals to the corresponding column or row, which signals to provide, etc. The block controller 120 may decode the instructions stored in the registers into signals used by the block row driver 131 or the block column driver 132.

The memory computing block 100 may further include a row sense amplifier 141 or a column sense amplifier 142 for reading data from a memory cell or for storing data into a memory cell. In some embodiments, row sense amplifiers 141 and column sense amplifiers 142 store buffered data. In some embodiments, the memory computing block 100 further includes a digital-to-analog converter (DAC) 151 or an analog-to-digital converter (ADC) 152 to convert input signals or output data between an analog domain and a digital domain. In some embodiments of the present disclosure, the row sense amplifier 141 or the column sense amplifier 142 is omitted because the calculations in the memory calculation block 100 can be performed directly on the values stored in the memory cells without reading out the values and without using any sense amplifier.

According to an embodiment of the present disclosure, the in-memory computation block 100 implements parallel computation by using a memory as a plurality of Single Instruction Multiple Data (SIMD) processing units. The in-memory computation block 100 may support computation operations including bitwise operations, additions, subtractions, multiplications, and divisions for integer and floating-point values. For example, in FIG. 1In the memory cell array 110, a first column c ₁ And a second column c ₂ The first vector a and the second vector B are stored, respectively. By applying formatted signals to the first to third columns c ₁ -c ₃ And the length of the vector A, B and the row corresponding to the length of C, the vector operation result C of adding the vectors A and B can be stored in the third column C ₃ In (1). Similarly, memory cell array 110 of FIG. 1 may also support vector multiply and add operations. For example, the first column c may be driven by applying a voltage signal corresponding to a multiplier a ₁ And applies a voltage signal corresponding to the multiplier b to a second column c ₂ And performing addition by applying the formatted signals to the corresponding columns and rows and saving the result C in a third column C ₃ To perform the calculation C = aA + bB.

In some embodiments, a vector is stored in a plurality of columns for representing n-bit (n-bit) values of elements. For example, a vector in which the elements have 2-bit values is stored in two columns of memory cells. In some embodiments, a vector may be stored in multiple memory blocks when the length of the vector exceeds the number of rows of memory cell array 110 that make up the memory block. Multiple memory blocks may be configured to compute different vector segments in parallel. Although the memory computing architecture in the embodiments does not use arithmetic logic other than memory cells to perform computing operations, the present disclosure may also be applied to memory computing architectures that include arithmetic logic in order to perform the arithmetic logic in the memory computing architecture. As indicated above, addition, multiplication, and the like may also be performed as column-by-column vector computations in a memory computing architecture. The disclosed embodiments provide an in-memory computing accelerator architecture that can implement efficient top-k operations, k-means clustering, or similarity search in large databases. In some embodiments, top-k operations, i.e., finding the k largest or smallest elements from a set, can be widely used for predictive modeling in information retrieval, machine learning, and data mining.

Fig. 2A illustrates an exemplary accelerator architecture 200 consistent with some embodiments of the present disclosure. In some embodiments, the accelerator architecture 200 is referred to as a neural network processing unit architecture. In the context of the present disclosure, a neural network accelerator is also referred to as a machine learning accelerator or a deep learning accelerator. In various embodiments, the accelerator architecture 200 may also be applied in a memory computing accelerator having various functions, such as an accelerator for parallel graphics processing, for database queries, or for other computing tasks. As shown in fig. 2A, the accelerator architecture 200 includes a memory compute accelerator 210, an interface 212, and the like. It should be appreciated that the in-memory computation accelerator 210 performs algorithmic operations based on the transferred data.

The memory compute accelerator 210 includes one or more memory slices 2024. In some embodiments, memory slice 2024 includes multiple memory blocks for data storage and computation. The memory block is configured to perform one or more operations (e.g., multiply, add, multiply accumulate, etc.) on the transferred data. In some embodiments, each memory block included in memory slice 2024 has the same configuration as in-memory computing block 100 shown in fig. 1. The in-memory computing accelerator 210 can provide versatility and scalability due to the layered design of the in-memory computing accelerator 210. The memory compute accelerator 210 may include any number of memory slices 2024 and each memory slice 2024 may have any number of memory blocks.

An interface 212, such as a peripheral component interconnect express (PCIe) interface, may be used as an inter-chip bus to provide communication between the memory computing accelerator 210 and the host unit 222. The inter-chip bus connects the memory compute accelerator 210 with other devices, such as off-chip memory or peripherals. In some embodiments, the accelerator architecture 200 further includes a Direct Memory Access (DMA) unit. The direct memory access unit may be considered part of the interface 212 or a separate component (not shown) in the memory compute accelerator 210 that facilitates transferring data between the host memory 224 and the memory compute accelerator 210. In addition, the direct memory access unit can assist in transferring data between the plurality of accelerators. The direct memory access unit allows an off-chip device to access on-chip and off-chip memory without causing a Central Processing Unit (CPU) interrupt of the host. Thus, the direct memory access unit may also generate a memory address and initiate a memory read or write cycle. The direct memory access unit may also contain a plurality of hardware registers that can be read and written to by one or more processors. The plurality of hardware registers include a memory address register, a byte count register, one or more control registers, and other types of registers that can specify some combination of source, destination, transfer direction (reading from or writing to an input/output (I/O) device), transfer unit size, or number of bytes transferred in a burst. It should be understood that the accelerator architecture 200 may include a second direct memory access unit for transferring data between other accelerator architectures to allow multiple accelerator architectures to communicate directly without involving a host central processor.

While the accelerator architecture 200 of FIG. 2A is explained as including an in-memory computation accelerator 210 having memory blocks (e.g., the in-memory computation block 100 of FIG. 1), it should be understood that the disclosed embodiments may be applied to any type of memory block that supports arithmetic operations to accelerate certain applications (e.g., deep learning).

The accelerator architecture 200 may also communicate with a host unit 222. The host unit 222 may be one or more processing units (e.g., an X86 central processing unit). In some embodiments, the in-memory computation accelerator 210 is considered a co-processor of the host unit 222.

As shown in FIG. 2A, the host unit 222 may be associated with a host memory 224. In some embodiments, the host memory 224 is an integrated memory or an external memory associated with the host unit 222. Host memory 224 may be a local or global memory. In some embodiments, host memory 224 comprises a host disk. The host disk is configured as external memory to provide additional storage for the host unit 222. The host memory 224 may be a Double Data Rate synchronous dynamic random-access memory (DDRSDRAM) or the like. The host memory 224 is configured to store large amounts of data at slower access speeds and acts as a higher level cache than the on-chip memory of the memory compute accelerator 210. Data stored in the host memory 224 may be transferred to the in-memory compute accelerator 210 for various computational tasks or to execute neural network models.

In some embodiments, host system 220, having host unit 222 and host memory 224, includes a compiler (not shown). A compiler is a program or computer software that converts computer code written in a programming language into instructions to create an executable program. In a machine learning application, a compiler may perform various operations such as preprocessing, lexical analysis, syntax parsing, semantic analysis, converting an input program into an intermediate representation, code optimization and code generation, or a combination thereof.

In some embodiments, the compiler pushes one or more commands to the host unit 222. Based on these commands, host unit 222 may assign any number of tasks to one or more memory slices (e.g., memory slice 2024) or processing elements. Some commands may instruct the direct memory access unit to load instructions and data from a host memory (e.g., host memory 224 of fig. 2A) into an accelerator (e.g., in-memory computation accelerator 210 of fig. 2A). Instructions may be loaded into each memory slice (e.g., memory slice 2024 of fig. 2A) assigned a corresponding task, and one or more memory slices may process the instructions.

It should be appreciated that the first few instructions may indicate to load/store data from the host memory 224 into one or more local memories of the memory slice. Each memory slice may then launch an instruction pipeline, which involves fetching the instruction from local memory (e.g., by a fetch unit), decoding the instruction (e.g., by an instruction decoder), and generating a local memory address (e.g., corresponding to an operand), reading source data, performing or load/store operations, and then writing back the result.

Fig. 2B illustrates a schematic diagram of an exemplary cloud system including a neural network accelerator architecture, consistent with some embodiments of the present disclosure. As shown in fig. 2B, the cloud system 230 provides a cloud service with artificial intelligence functionality and includes a plurality of compute servers (e.g., compute servers 232 and 234). In some embodiments, the compute server 232 may, for example, incorporate the accelerator architecture 200 of FIG. 2A. For simplicity and clarity, the accelerator architecture 200 is shown in a simplified manner in fig. 2B.

With the assistance of the accelerator architecture 200, the cloud system 230 can provide extended data processing functionality. For example, in some embodiments, the cloud system 230 can provide artificial intelligence functions such as image recognition, facial recognition, translation, 3D modeling, and so forth. It should be understood that the accelerator architecture 200 may be deployed to a computing device in other forms. For example, the accelerator architecture 200 may also be integrated in computing devices, such as smartphones, tablets, and wearable devices.

FIG. 3 illustrates a structure of an exemplary memory slice consistent with certain embodiments of the present disclosure. Memory slice 300 includes a storage block component 310, a controller 320, a row driver 331, a column driver 332, a global buffer 340, an instruction memory 350, a data transfer table 360, and a block table 370. According to some embodiments of the present disclosure, storage block component 310 includes a plurality of storage blocks arranged in a two-dimensional grid.

Controller 320 provides commands to each memory block in memory block assembly 310 through row drivers 331, column drivers 332, and global buffer 340. A row driver 331 is connected to each row memory block in memory block assembly 310 and a column driver 332 is connected to each column memory block in memory block assembly 310. In some embodiments, the block controller (e.g., block controller 120 of fig. 1) included in each memory block is configured to receive commands from the controller 320 through the row driver 331 or the column driver 332 and to issue signals to the block row driver (e.g., block row driver 131 of fig. 1) and the block column driver (e.g., block column driver 132 of fig. 1) to perform corresponding operations in the memory. According to the embodiment of the present disclosure, by using the block controller in the memory block of the memory block assembly 310, the memory blocks can independently perform different operations, so that block-level parallel processing can be performed, and data can be efficiently transferred between memory cells arranged in rows and columns in the corresponding memory block through the block controller.

In some embodiments, global buffer 340 is used to transfer data between storage blocks in storage block component 310. For example, controller 320 uses global buffer 340 when transferring data from one memory block to another memory block in memory block component 310. According to some embodiments of the present disclosure, global buffer 340 is shared by all memory blocks in memory block component 310. Global buffer 340 may be configured to store commands per memory block to handle tasks assigned in processing the neural network model. In some embodiments, the controller 320 is configured to send commands stored in the global buffer 340 to the corresponding memory blocks through the row driver 331 and the column driver 332. In some embodiments, these commands are transmitted from a host unit (e.g., host unit 222 of FIG. 2A). Global buffer 340 may be configured to store data for processing the assigned tasks and to send the data to the memory blocks. In some embodiments, data stored in the global buffer 340 and sent from the global buffer 340 is transferred from a host unit (e.g., the host unit 222 of FIG. 2A) or other storage block in the storage block component 310. In some embodiments, controller 320 is configured to store data from storage blocks in storage block component 310 into global buffer 340. In some embodiments, controller 320 receives and stores an entire row of data of one of the memory blocks in memory block component 310 into global buffer 340 in one cycle. Similarly, controller 320 sends an entire row of data from global buffer 340 to another memory block in one cycle.

In some embodiments, the memory slice 300 of fig. 3 includes an instruction memory 350, the instruction memory 350 configured to store instructions to execute the neural network model in the storage block component 310 in a pipelined manner. Instruction store 350 may store instructions to compute instructions or to store data movement between storage blocks of block component 310. The controller 320 may be configured to access the instruction memory 350 to retrieve instructions stored in the instruction memory 350. Instruction memory 350 may be configured to have a separate instruction segment assigned to each memory block. In some embodiments, the memory slice 300 includes a data transfer table 360 for recording data transfers in the memory slice 300. The data transfer table 360 may be configured to record data transfers between memory blocks. In some embodiments, the data transfer table 360 may be configured to record pending data transfers. In some embodiments, memory slice 300 may include a block table 370 for recording the state of memory blocks. The block table 370 may have a State field (State) storing the current State of the corresponding memory block. In accordance with some embodiments of the present disclosure, during execution of a computation, storage block component 310 may have one of several states including, for example, an idle state, a compute state, and a ready state.

Fig. 4 illustrates an exemplary in-memory computing processing unit 400 consistent with some embodiments of the present disclosure. In some embodiments, the in-memory compute processing unit 400 is applied to an architecture that is the same as or similar to the accelerator architecture 200 shown in FIG. 2A and the memory slice configuration (e.g., memory slice 300) shown in FIG. 3. In some embodiments, in-memory computing processing unit 400 is referred to as an in-memory computing data processing unit (PIM-DPU). The in-memory compute processing unit 400 includes a memory array 410, a memory interface 420, compute circuitry 430, a host interface 440, configuration registers 450, and a controller 460, which may be integrated onto the same chip or the same die or embedded in the same package. For example, in some embodiments, in-memory computing processing unit 400 is on a die of dynamic random access memory, where the storage device is a dynamic random access memory or an embedded dynamic random access memory having a memory array 410, memory array 410 including memory cells arranged in rows and columns. Further, memory array 410 may be divided into a plurality of logical blocks or partitions, also referred to as "trunks" for storing data, each trunk including one or more rows of memory array 410. For example, the in-memory computing processing unit 400 includes a 4 gbit dynamic random access memory, but the disclosure is not limited thereto. The in-memory computing processing unit 400 may also include dynamic random access memory units having various capacities. In some embodiments, the in-memory computing processing unit 400 with dynamic random access memory or embedded dynamic random access memory cells is referred to as a dynamic random access memory-based or embedded dynamic random access memory-based accelerator.

External agents, such as hosts, may be communicatively coupled through a peripheral interface (e.g., host interface 440) and communicate commands, instructions, or data to each other through the peripheral interface to communicate with in-memory compute processing unit 400 and program configuration registers 450 to configure various parameters for performing computations. For example, host interface 440 may be a peripheral component interconnect express interface, although the disclosure is not so limited. The configuration register 450 may store a configuration including parameters such as a value of K for top-K ordering calculations, a partition block size of a memory block in the memory array 410 of the dynamic random access memory, and the like.

The controller 460 may communicate with the configuration registers 450 to access the stored parameters and accordingly instruct the computation circuitry 430 to perform a series of operations to perform various computations, such as top-k ordering computations, k-means clustering computations, or other computations for accelerating similarity search methods on large data sets. For example, in top-K ordering computation, the computation circuit 430 computes and outputs the first through Kth maximum or minimum values in the data set, where K may be any integer. In K-means clustering calculations, the computation circuit 430 may divide the N data points (or "observations") into K sets (or "clusters") in order to minimize the within-cluster sum of squares (WCSS) or variance. K may be any integer greater than 1 and N may be any integer greater than K. In some applications, the K value for top-K ordering or K-means calculation is a number between 64 and 1500, but the disclosure is not so limited. In some embodiments, the number of vectors in the dataset used for top-k sorting or k-means calculation is at 10 ⁸ To 10 ¹⁰ Left and right, but the disclosure is not limited thereto.

In some embodiments, instructions from controller 460 are decoded and then executed by computational circuitry 430. The computation circuitry 430 includes storage components and computation components for performing computations. For example, the storage components in the computation circuit 430 include one or more vector registers 432 and one or more vector registers 436, and the computation components in the computation circuit 430 include one or more reducers 434 and a scalar Arithmetic Logic Unit (ALU) 438. The computation circuit 430 may read data from the memory block 410 or write data to the memory block 410 via the memory interface 420. For example, memory interface 420 may be a wide input/output interface connecting memory block 410 and computational circuitry 430, which provides 1024 bits per cycle (e.g., 2 nanoseconds) of read/write.

Referring now to fig. 5, exemplary operations performed by the in-memory computing processing unit 400 for a top-k ordering method are shown, consistent with certain embodiments of the present disclosure. In the embodiment shown in FIG. 5, the in-memory computation processing unit 400 performs a top-K ordering method, where the value of K is greater than the number of elements that the vector can accommodate. For example, if data is stored in 32 bits, a 1024-bit vector can accommodate 32 data elements. In response to the value of K programmed in configuration register 450 being greater than 32, controller 460 may instruct computational circuitry 430 to perform the method illustrated in fig. 5.

In this scenario, the memory array 410 includes a plurality of logical blocks 412, 414, and 416 that store data elements. The calculation circuit 430 receives the data elements from the logic blocks 412, 414, and 416 and further calculates a block maximum or minimum element for each of the logic blocks 412, 414, and 416.

For example, the calculation circuit 430 reads a vector (e.g., a 1024-bit vector storing a plurality of data elements) and compares the minimum value stored in the vector with a scalar value of the current minimum value in the current logical block. By repeating the above process and reading each vector in the current logical block, the computation circuit 430 can obtain the minimum element in the current logical block and the block Identification (ID) associated with the minimum element of the block. The computation circuitry 430 may also perform similar operations to obtain the largest element in the current logical block and the block identification associated with the largest element of the block.

The computational circuitry 430 may store the minimum (or maximum) element of each of the logic blocks 412, 414, and 416 in one entry of one of the plurality of vector registers 432. The computational circuitry 430 may then use the reducer 434 to determine a global minimum (or maximum) element based on the block minimum (or maximum) elements of the logical blocks 412, 414, and 416. For example, a minimum reducer is used to determine the global minimum element, and a maximum reducer is used to determine the global maximum element.

After storing the global minimum (or maximum) element as one of the first K data elements, the computation circuitry 430 disables the stored global minimum (or maximum) element and repeats the above operations to obtain a new chunk minimum (or maximum) element for the logical chunk associated with the stored, disabled global minimum (or maximum) element.

After obtaining a new block minimum (or maximum) element for the logical block, the computational circuitry 430 may again use the reducer 434 to determine a second global minimum (or maximum) element based on the block minimum (or maximum) elements of the logical blocks 412, 414, and 416. Thus, to determine the first K data elements (which may be the largest K data elements or the smallest K data elements), the computation circuit 430 may repeat the above operations K cycles to obtain the first through kth global minimum (or maximum) elements.

Referring now to fig. 6, exemplary operations performed by an in-memory computing processing unit for another top-k ordering method are illustrated, consistent with certain embodiments of the present disclosure. In contrast to the embodiment shown in fig. 5, in the embodiment shown in fig. 6, the in-memory computation processing unit 400 performs a top-K ordering method, where the value of K is less than or equal to the number of vector-containable elements. For example, if data is stored in a 1024-bit vector with 32 bits, in response to the value of K programmed in configuration register 450 being less than or equal to 32, controller 460 may instruct computational circuit 430 to perform the method shown in fig. 6.

In this scenario, vector register 432 is configured to store the current minimum K values. The calculation circuit 430 uses a reducer 434 (e.g., a max reducer) to obtain the maximum of the current minimum K values and stores the maximum in a scalar register 436.

When the computation circuit 430 reads a vector (e.g., a vector storing 1024 bits of a plurality of data elements) from the memory array 410, the computation circuit 430 stores one or more minimum values in the vector in the scalar register 436. Scalar arithmetic logic unit 438 may communicate with scalar registers 436 and compare one or more minimums in the vector with a maximum of the current minimum K values. In response to one or more of the minimum values in the vector being less than the maximum value of the current minimum K values, the calculation circuit 430 may replace the maximum value of the current minimum K values in the vector register 432 with the one or more minimum values in the vector and then recalculate a new maximum value of the current minimum K values in the vector register 432.

The calculation circuit 430 may repeatedly perform the above operations until all data elements are read and processed. Thus, the K values retained in vector register 432 after this iterative process are the smallest K data elements.

Computing circuitry 430 may perform similar operations to store the largest K data elements in vector registers 432 by storing the currently largest K values in vector registers 432, comparing one or more maximum values of a vector read from memory array 410 with the minimum value of the currently largest K values stored in scalar registers 436 by scalar arithmetic logic unit 438, and updating vector registers 432 based on the comparison results. Thus, the computational circuitry 430 may determine the first K data elements, which may be the largest K data elements or the smallest K data elements.

Referring now to fig. 7A and 7B, an exemplary in-memory computing processing unit 700 is shown, consistent with some embodiments of the present disclosure. Similar to the in-memory compute processing unit 400 in fig. 4, in some embodiments, the in-memory compute processing unit 700 may be applied to an architecture that is the same as or similar to the accelerator architecture 200 shown in fig. 2A and the memory slice configuration (e.g., memory slice 300) shown in fig. 3. In some embodiments, in comparison to in-memory computation processing unit 400 in fig. 4, in-memory computation processing unit 700 includes more memory components and computation components to perform computations. For example, the computational circuitry 430 in the in-memory computational processing unit 700 may further include a static random access memory 732, a decoder 734, and a single instruction multiple data processor 736, the single instruction multiple data processor 736 including one or more adders, subtractors, multipliers, multiply accumulators, or any combination thereof.

Fig. 7B illustrates how the storage and computing components in the in-memory computing processing unit 700 communicate and cooperate to perform various computing tasks. In some embodiments, in-memory compute processing unit 700 uses sram 732 and decoder 734 to perform a Product Quantization (PQ) compression method to compress or reconstruct data received from memory array 410 or controller 460 for later data processing or operation in compute circuit 430.

In some embodiments, the computation circuit 430 performs a similarity search or a k-means algorithm and computes the distance between two vectors in a highly parallel computation scalable manner using the single instruction multiple data processor 736 and the reducer 434. After calculation, the calculation circuit 430 may store the calculated distance value in the vector register 432 or the scalar register 436. The computation circuit 430 may perform top-k sort operations through its registers (e.g., vector registers 432 or scalar registers 436) and a maximum reducer in the reducer 434, based on a heap sort algorithm or any other suitable algorithm. The detailed operation will be further discussed in the following paragraphs.

Fig. 8 illustrates an accelerator architecture 800 based on memory computations consistent with some embodiments of the present disclosure. As shown in FIG. 8, in some embodiments, in-memory compute processing units 810a-810n are implemented by in-memory compute processing unit 400 in FIG. 4 or in-memory compute processing unit 700 in FIG. 7A and provide high scalability and high capacity. In the memory computing based accelerator architecture 800 shown in FIG. 8, a memory computing system may include a plurality of memory computing processing units 810a-810n, and each of the memory computing processing units 810a-810n communicates with a host 820 via a handshake protocol, a Double Data Rate (DDR) protocol, or any other suitable protocol.

In some embodiments, the in-memory computing processing units 810a-810n do not have direct communication capabilities. Alternatively, the in-memory computing processing units 810a-810n communicate only with the host 820, although the disclosure is not so limited. In some other embodiments, some or all of the in-memory computing processing units 810a-810n may also communicate directly with one or more of the in-memory computing processing units 810a-810n via an appropriate protocol. In some embodiments, the memory computing based accelerator architecture 800 includes hundreds or thousands of memory computing processing units 810a-810n depending on different capacity requirements in various applications. In general, the in-memory compute processing units 810a-810n in FIG. 8 may handle and process many different types of highly parallel computations and send the final computation results to the host 820. Thus, data communication between the in-memory computing chip and the host 820 is reduced.

Reference is now made to fig. 9, which illustrates exemplary operations performed by an in-memory computing processing unit for performing asynchronous computations of similarity searches, consistent with certain embodiments of the present disclosure. As shown in fig. 9, the memory array 410 may include 4 dynamic random access memory blocks. In some embodiments, during the similarity search, the in-memory computation processing unit 700 computes distance values between vectors stored in the dram blocks and performs a top-k ordering method to order the computed distance values.

As described above, the calculation circuit 430 can calculate the distance value between vectors in a highly parallel calculation manner by the sum reducer in the single instruction multiple data processor 736 and the reducer 434. The parallel computation output from the single instruction multiple data processor 736 may be stored or accumulated in a vector accumulator 738.

For the asynchronous calculation in FIG. 9, the distance value accumulated or stored in the vector accumulator 738 may then be written back to one of the DRAM blocks in the memory array 410. The compute circuit 430 may then access the distance values in the memory array 410 and perform a top-k sort operation through its registers (e.g., vector registers 432 or scalar registers 436) and reducer 434 (e.g., a minimum reducer) based on a merge sort algorithm or any other suitable algorithm. The minimum value of the dynamic random access memory block and the corresponding tag may be stored in a register of the calculation circuit 430. Thus, the calculation circuit 430 may first use the smallest of the reducers 434 to find the minimum value stored in the register, and then continue to look up the corresponding minimum value in the dynamic random access memory block storing the distance value and output the minimum value to the host. That is, in the asynchronous calculation of FIG. 9, the calculation of the distance values and the top-k sorting operation are performed in different time periods.

Referring now to fig. 10, exemplary operations performed by the in-memory computing processing unit 700 for performing a synchronous computation of a similarity search are illustrated, consistent with some embodiments of the present disclosure. In contrast to the asynchronous calculation in FIG. 9, for the synchronous calculation shown in FIG. 10, the distance value may be stored in a register and not written back to the memory array 410.

The computing circuitry 430 performs top-k sort operations through its registers (e.g., vector registers 432 or scalar registers 436) and reducers 434 (e.g., maximum reducers) based on a heap sort algorithm. In particular, the computational circuit 430 may use a maximum one of the reducers 434 to maintain a minimum top-k file and a register to store the file. Thus, in the synchronous calculation of FIG. 10, the calculation of the distance values and the top-k sorting operation are performed simultaneously and synchronously within the calculation circuit 430.

Reference is now made to fig. 11 and 12, which illustrate exemplary operations performed by the memory computation processing unit 700 for k-means clustering computations, consistent with embodiments of the present disclosure. k-means clustering is a vector quantization method that aims to divide n observations into k sets (e.g., clusters). Each observation is a vector and belongs to a cluster with the closest mean. That is, each observation is assigned to a cluster having the closest cluster center (i.e., cluster centroid).

Given a set of initial k-means, k-means clustering computation is performed by alternating between the assigning step and the updating step. Fig. 11 illustrates an exemplary operation of the allocation step, in which the in-memory calculation processing unit 700 allocates each vector to a cluster having the closest mean (e.g., a cluster having the smallest squared euclidean distance). Fig. 12 illustrates an exemplary operation of the update step, in which the in-memory computation processing unit 700 recalculates the mean or centroid (i.e., the imaginary or actually existing data points at the cluster center) for the vector assigned to each cluster. For example, in some embodiments, the centroid of a cluster may be calculated and defined based on the following formula:

wherein x is ₁ To x _n Is the n vectors to be clustered, and m ₁ ^(t) To m _k ^(t) Respectively represent k clusters S ₁ ^(t) To S _k ^(t) Centroid in the t-th iteration.

As shown in fig. 11, the memory array 410 stores vectors in its memory block 1010 and stores the current centroid (mean) of a cluster in one of its row buffers 1020. In some embodiments, when an allocation row buffer 1020 stores the current centroid, the associated memory block will not be allocated to store the vector, and thus the vector stored in memory array 410 may be read accordingly through the respective row buffer. The compute circuit 430 may read the feature vectors in the memory array 410 and all centroids in the line buffer 1020 to the

registers

432 or 436. Thus, the calculation circuit 430 may calculate the distance value between the feature vector and the centroid in a highly parallel calculation manner through the sum reducer in the single instruction multiple data processor 736 and the reducer 434, and store the calculated distance between the feature vector and the centroid in the sram 732.

The calculation circuit 430 may then use the smallest of the reducers 434 to find the minimum value stored in the static random access memory 732 to assign the feature vector to the cluster with the closest mean. Thus, the compute circuitry 430 may tag a cluster identification to the feature vector indicating that the centroid is closest to the feature vector and write the feature vector with the cluster identification back to the memory array 410.

As shown in FIG. 12, in the update step, the calculation circuit 430 reads the vector labeled with the cluster identifier in the memory array 410 to the

register

432 or 436 and calculates an updated centroid from one or more vectors labeled with the corresponding cluster identifier (e.g., the same cluster identifier) by the single instruction multiple data processor 736. The updated centroid may then be written back into the row buffer 1020. In the update step, random accesses may be reduced by changing the access order of the vectors and centroids, which reduces random memory accesses in the k-means clustering computation.

By repeating the assigning and updating steps, the computation circuit 430 may cluster the vectors stored in the memory array 410 until convergence is reached. In response to the centroid not changing after the updating step (e.g., no vectors are assigned to different clusters in the assigning step), the compute circuitry 430 may output cluster results to the host or store cluster results in the memory array 410.

Fig. 13 illustrates a flow diagram of an exemplary method 1300 of performing data processing on an in-memory computing-based accelerator architecture consistent with certain embodiments of the present disclosure. According to some embodiments of the present disclosure, memory computation based accelerator architectures (e.g., accelerator architecture 200 in fig. 2A, memory slice 300 in fig. 3, and memory computation based accelerator architecture 800 in fig. 8) are used to perform top-k ordering, k-means clustering, or similarity search. In particular, any of the in-memory compute processing units (e.g., in-memory compute processing units 810a-810n in FIG. 8) in the accelerator architecture based on in-memory compute may select between compute modes of the in-memory compute processing unit based on a configuration of a host (e.g., host 820 in FIG. 8) to which the in-memory compute processing unit is communicatively coupled. The computation patterns may include one or more of a first top-k ordering pattern, a second top-k ordering pattern, and a k-means clustering pattern. Depending on the selected mode, the in-memory compute processing unit may access the data elements in the memory array to perform the corresponding operations.

The data processing method 1300 in FIG. 13 illustrates the operation of the top-k sort calculation performed by the accelerator architecture based on in-memory calculations when the top-k sort mode is selected. At step 1310, an in-memory compute processing unit (e.g., in-memory compute processing units 810a-810n in FIG. 8) in an in-memory compute based accelerator architecture receives a configuration from a host (e.g., host 820 in FIG. 8) communicatively connected to the in-memory compute processing unit. The in-memory computation processing unit may select between computation modes depending on the configuration. In data processing method 1300, the computation patterns include a first top-k ordering pattern and a second top-k ordering pattern.

In step 1320, the memory computing processing unit determines whether to operate in the first top-k ordering mode or the second top-k ordering mode. Specifically, the memory calculation processing unit compares the K value in the configuration with a threshold value. In response to the value of K in the configuration being greater than the threshold (YES at step 1320), the memory compute processing unit selects a first top-K sort mode and performs steps 1331-1338. In response to the value of K in the configuration being less than or equal to the threshold (NO at step 1320), the memory computing processing unit selects a second top-K ordering mode and performs steps 1341-1346.

In the first top-k ordering mode, at step 1331, compute circuitry of an in-memory compute processing unit receives data elements from a plurality of logic blocks in a memory array.

At step 1332, the computation circuitry computes a block maximum (or minimum) element for each logic block. For example, if a top-K ordering pattern is used to determine the maximum K data, the computation circuitry computes the block maximum element. On the other hand, if the top-K ordering mode is used to determine the minimum K data, the computation circuit computes the block minimum element.

At step 1333, the computational circuitry stores the block largest (or smallest) elements of the logic block in one or more vector registers in the computational circuitry. The computing circuitry then repeats steps 1334-1338 until the first K data elements are determined.

At step 1334, the computational circuitry determines a global maximum (or minimum) element based on the block maximum (or minimum) element of the logical block. At step 1335, the computation circuitry stores the determined global maximum (or minimum) element as one of the top K data elements. After storing the global maximum (or minimum) element, at step 1336, the computation circuitry disables the global maximum (or minimum) element in the associated logical block. At step 1337, the computational circuitry obtains the next block maximum or minimum element of the logical block associated with the stored, disabled global maximum or minimum element.

That is, through

steps

1336 and 1337, when a global maximum (or minimum) element is determined and stored, the computation circuitry need only recalculate and update a new chunk maximum (or minimum) element for this corresponding logical chunk, and the chunk maximum (or minimum) elements of the other logical chunks can be reused in the next iteration to determine the next global maximum (or minimum) element.

At step 1338, the computational circuitry determines whether all the first K data elements are determined and obtained. If all of the first K data elements are determined and stored (YES at step 1338), the computational circuitry performs step 1350 and outputs the first K of the data elements to the host or the memory array. Otherwise (NO at step 1338), steps 1334-1338 are repeated to determine and store the first K data elements in order.

When the second top-K sort mode is selected based on the determination of step 1320, the compute circuitry of the in-memory compute processing unit first stores K initial data elements from the memory array to the first register in step 1341. The computational circuitry then repeatedly updates the first register by repeating steps 1342-1346 until all data elements from the memory array are received and processed.

At step 1342, the computational circuitry selects the largest (or smallest) element in the first register as the target element. For example, if top-K ordering mode is used to determine the largest K data, the computation circuitry selects the smallest element in the first register as the target element. On the other hand, if the top-K ordering mode is used to determine the smallest K data, the computation circuit computes the largest element in the first register as the target element. That is, the target element is an element that can be evicted from the first register and replaced by another data element during a subsequent update.

At step 1343, the computational circuitry determines the first K candidate elements from the one or more remaining data elements received by the memory array. For example, the remaining data elements may be new data read from an unprocessed vector of the dynamic random access memory data array. The computation circuitry may receive the vector from the memory array and select the smallest (or largest) element in the vector as the first K candidate elements.

In step 1344, the computing circuitry compares the top K candidate elements to the target element. If the top-K ordering mode is used to determine the maximum K data, the computational circuitry determines whether the first K candidate elements are larger than the target element currently stored in the first register. If a top-K ordering pattern is used to determine the minimum K data, the computing circuitry determines whether the first K candidate elements are smaller than the target element currently stored in the first register. Thus, the computation circuitry may determine whether to replace the target element in the first register with the first K candidate elements based on the comparison. In particular, in some embodiments, the computation circuitry stores the target element and the first K candidate elements in a second register in the computation circuitry and compares the first K candidate elements with the target element by the scalar arithmetic logic unit to obtain a comparison result.

If the computing circuitry determines that the target element in the first register should be replaced (YES at step 1344), the computing circuitry performs step 1345 to replace the target element with the first K candidate elements. Otherwise (no at step 1344), step 1345 is bypassed and the data stored in the first register remains unchanged.

At step 1346, the computational circuitry determines whether all data elements of interest in the memory array have been processed. If there are more data elements remaining (e.g., data in the vector that has not yet been processed) to process (NO at step 1346), steps 1342-1346 are repeated to update the current first K data elements stored in the first register. In the event that all data elements of interest in the memory array have been processed (yes at step 1346), the computational circuitry performs step 1350 and outputs the first K data elements stored in the first register to the host or memory array.

Thus, through the above operations, the data processing method 1300 in fig. 13 may implement a top-K ordering calculation to output K largest (or smallest) data elements, where the value of K may be greater than or less than the number of elements held by one vector in the vector register. Specifically, if the value of K is an integer greater than a threshold, a first top-K ordering mode is selected; if the value of K is an integer less than or equal to the threshold, then the second top-K ordering mode is selected.

Fig. 14 illustrates a flow diagram of an exemplary method 1400 of performing another data processing on an in-memory computing-based accelerator architecture consistent with certain embodiments of the present disclosure. The data processing method 1400 in fig. 14 illustrates operations of k-means clustering computations performed by memory computation based accelerator architectures (e.g., the accelerator architecture 200 in fig. 2A, the memory slice 300 in fig. 3, and the memory computation based accelerator architecture 800 in fig. 8) in a k-means clustering mode.

In k-means clustering mode, at step 1410, at least one memory compute processing unit (e.g., memory compute processing units 810a-810n in FIG. 8) initializes the centroids of the clusters. For example, to have K clusters, the memory compute processing unit may provide K initial centroids and store the initial centroids in a row buffer of the memory array. The memory compute processing unit then clusters the plurality of vectors stored in the memory array by repeating the assigning step 1420 and the updating step 1430. In an assignment step 1420, the in-memory computation processing unit assigns each vector to one of the current clusters based on the distance between the vector point and the centroid of the cluster. In an update step 1430, the in-memory computation processing unit updates the centroid of each cluster based on the corresponding vector. Each centroid is the average of the vectors assigned to the same cluster in the assignment step 1420.

In particular, the allocation step 1420 includes sub-steps 1421-1425. In step 1421, the compute circuit receives centroids from the row buffers. In step 1422, the computational circuitry receives a feature vector selected from the vectors of the memory array. In step 1423, the computation circuit marks a cluster identification on the feature vector, where the cluster identification indicates the centroid closest to the feature vector. At step 1424, the compute circuit writes back the feature vector with the cluster identification to the memory array.

At step 1425, the computational circuitry determines whether all vectors of interest in the memory array are assigned to an associated cluster. If there are remaining vectors to process (NO at step 1425), then steps 1421-1425 are repeated to allocate remaining vectors. In the event that all vectors of interest in the memory array are allocated (YES at step 1425), the computational circuitry proceeds to update step 1430.

The updating step 1430 also includes sub-steps 1431 and 1432. At step 1431, the computation circuitry computes an updated centroid from one or more vectors labeled with the corresponding cluster identifications. At step 1432, the computing circuitry determines whether to update the centroids of all cluster identifications based on the most recent allocation obtained at step 1420. If there are remaining centroids to update (no at step 1432), steps 1431 and 1432 are repeated.

When all centroids are updated (yes at step 1432), at step 1440, the compute circuit checks if none of the centroids have changed after step 1430 in the current cycle. If one or more centroids are changed (NO at step 1440), the memory computing processing unit repeats allocation step 1420 and update step 1430 until convergence is reached.

In response to the centroid not changing after the update operation (yes at step 1440), the in-memory computing processing unit performs step 1450 and outputs the clustering result to the host or stores the clustering result to the memory array. Therefore, through the above operations, the data processing method 1400 in fig. 14 may implement k-means clustering calculation and output a clustering result.

In summary, as proposed in various embodiments of the present disclosure, the proposed apparatus and method may take advantage of the high bandwidth of dynamic random access memory and guarantee efficient, parallel and fast computation. With wide input/output (e.g., 1024-bit per cycle read/write) between the memory array (e.g., dynamic random access memory data array) and the computation circuit, the memory performance bottleneck in the similarity search and k-means computation is significantly reduced.

Furthermore, by performing k-means clustering computations using the proposed apparatus and method, unnecessary data movement between the memory array and the computation circuitry is reduced and minimized. In addition, random memory accesses are reduced by storing centroids in a line buffer and altering the access order of vectors and centroids in the update step of the k-means cluster computation. Thus, the overall efficiency of k-means clustering is improved. In some embodiments, the efficiency of the similarity search may depend only on the ratio of bandwidth to memory capacity. As data increases, the computation time of the similarity search can still be kept at several tens of milliseconds.

Embodiments of the present disclosure may be applied to many products, environments, and scenarios. For example, some embodiments of the present disclosure may be applied to in-memory computing, such as artificial intelligence in-memory computing, including dynamic random access memory-based processing units. Some embodiments of the present disclosure may also be applied to tensor processing units, data processing units, neural network processing units, and the like.

Embodiments of the present disclosure also provide a computer program product. The computer program product includes a non-transitory computer readable storage medium having computer readable program instructions thereon for causing a processor to perform the above-described method.

The computer readable storage medium may be a tangible device capable of storing instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. A more specific exemplary, non-exhaustive list of computer-readable storage media includes the following: a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory (EPROM), a static random access memory, a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device (e.g., punch cards or raised structures in grooves having instructions recorded thereon), and any suitable combination of the foregoing.

The computer-readable program instructions for performing the above-described methods may be assembler instructions, instruction set-architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state-setting data, or source or object code written in any combination of one or more programming languages, including an object-oriented programming language and a conventional procedural programming language. The computer-readable program instructions may execute entirely on the computer system as a stand-alone software package or may execute partially on a first computer and partially on a second computer remote from the first computer. In the latter scenario, the remote second computer may be connected to the first computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).

The computer-readable program instructions may be provided to a processor of a computer or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the methods described above.

The flowcharts and diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present description. In this regard, a block in the flowchart or diagram may represent a software program, segment, or portion of code, which comprises one or more executable instructions for implementing the specified function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart or diagram, and combinations of blocks in the flowchart or diagram, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is appreciated that certain features of the disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the disclosure that are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the disclosure. Certain features described in the context of various embodiments should not be considered essential features of those embodiments unless the embodiments are not practicable without such elements.

Embodiments may be further described using the following claims:

1. an in-memory computing device, comprising:

a memory array configured to store data; and

computing circuitry configured to execute a set of instructions to cause the in-memory computing device to perform the steps of:

selecting between a plurality of computing modes based on a configuration from a host communicatively coupled to the in-memory computing device, wherein the plurality of computing modes includes a first ordering mode and a second ordering mode;

accessing a plurality of data elements in a memory array of the in-memory computing device; and

outputting a first K data elements of the plurality of data elements to the memory array or the host in the first ordering mode or the second ordering mode;

wherein K is an integer greater than a threshold if the first sorting mode is selected, and is an integer less than or equal to the threshold if the second sorting mode is selected.

2. The in-memory computing device of claim 1, the compute circuitry further comprising a vector register, wherein in the first ordering mode, the compute circuitry receives the plurality of data elements from a plurality of logic blocks in the memory array, and wherein the compute circuitry is further configured to execute the set of instructions to cause the in-memory computing device to determine the first K data elements by:

calculating a block maximum or minimum element for each of the plurality of logical blocks;

storing the block maximum or minimum elements of the plurality of logical blocks in the vector register; and

repeating the following operations until the first K data elements are determined:

determining a global maximum or minimum element based on a block maximum or minimum element of the plurality of logical blocks;

storing the global maximum or minimum element as one of the first K data elements;

disabling the global maximum or minimum element in its logical block; and

a next chunk maximum or minimum element of the logical chunk associated with the global maximum or minimum element that is disabled is obtained.

3. The in-memory computing device of claim 1 or 2, the computing circuitry comprising a first register, wherein in the second ordering mode, the computing circuitry is further configured to execute the instructions to cause the in-memory computing device to determine the top K data elements by:

storing a plurality of initial data elements from the memory array to the first register; and

updating the first register until the plurality of data elements from the memory array are received and processed by repeating:

selecting a maximum or minimum element in the first register as a target element;

determining a candidate element from among the one or more remaining data elements received from the memory array; and

determining whether to replace the target element in the first register with the candidate element based on a comparison of the candidate element and the target element.

4. The in-memory computing device of claim 3, the computing circuitry comprising a second register and a scalar arithmetic logic unit, wherein the computing circuitry is further configured to execute the instructions to cause the in-memory computing device to determine whether to replace the target element in the first register with the candidate element:

storing said target element and said candidate element in said second register; and

comparing, by the scalar arithmetic logic unit, the candidate element with the target element.

5. The in-memory computing device of any of claims 1-4, wherein the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to:

selecting the first ordering in response to a value of K in the configuration being greater than the threshold; and

selecting the second ordering in response to the value of K in the configuration being less than or equal to the threshold.

6. The in-memory computing device of any of claims 1-5, wherein the plurality of computing modes comprises a k-means clustering mode, wherein in the k-means clustering mode, the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to:

clustering a plurality of vectors stored in the memory array by repeating the steps of:

assigning each of the plurality of vectors to one of a plurality of clusters; and

updating a plurality of centroids for the plurality of clusters, wherein each of the plurality of centroids is an average of one or more respective vectors assigned to the same cluster.

7. The in-memory computing device of claim 6, wherein, in the k-means clustering mode, the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to assign each of the plurality of vectors by:

receiving the plurality of centroids from a row buffer of the memory array;

receiving, from the memory array, a feature vector selected from the plurality of vectors;

tagging a cluster identification to the feature vector, the cluster identification indicating a closest one of the plurality of centroids to the feature vector; and

writing the feature vector with the cluster identification back to the memory array.

8. The in-memory computing device of claim 7, wherein, in the k-means clustering mode, the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to update each of the plurality of centroids by:

an updated centroid is calculated based on one or more vectors in the memory array having corresponding cluster identifications.

9. The in-memory computing device of any of claims 6-8, wherein, in k-means clustering mode, the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to:

in response to the plurality of centroids not changing after updating each of the plurality of centroids, outputting a clustering result to the host or the memory array.

10. The in-memory computing device of any of claims 1-9, further comprising:

a host interface configured to communicate the in-memory computing device with a host;

a configuration register configured to store the configuration from the host; and

a controller configured to send the set of instructions according to the configuration.

11. The in-memory computing device of any of claims 1-10, wherein the computing circuitry further comprises:

one or more registers configured to store data for computation;

a storage device configured to store the set of instructions; and

a decoder configured to decode the instruction set.

12. The in-memory computing device of any of claims 1-11, wherein the computing circuitry further comprises one or more single instruction multiple data units, one or more reducer units, one or more arithmetic logic units, or any combination thereof.

13. The in-memory computing device of any of claims 1-12, wherein the memory array comprises a dynamic random access memory array, and further comprising an input/output interface for communication of the dynamic random access memory array and the computing circuitry.

14. A method of data processing, comprising:

selecting between a plurality of computing modes based on a configuration of an in-memory computing device, wherein the plurality of computing modes includes a first ordering mode and a second ordering mode;

in the first or second ordering modes, outputting a first K data elements of the plurality of data elements to the memory array or a host communicatively coupled with the in-memory computing device;

15. The data processing method of claim 14, further comprising:

where the in-memory computing device is configured to operate in the first rank mode:

receiving the plurality of data elements from a plurality of logical blocks in the memory array; and

determining the first K data elements by:

calculating and storing a block maximum or minimum element for each of the plurality of logical blocks; and

determining a global maximum or minimum element based on the block maximum or minimum elements of the plurality of logical blocks:

storing the global maximum or minimum element as one of the top K data elements:

disabling the global maximum or minimum element in its logical block; and

16. The data processing method of claim 14 or 15, further comprising:

in a case where the in-memory computing device is configured to operate in a second sort mode, determining the top K data elements by:

storing a plurality of initial data elements from the memory array to a first register; and

determining a candidate element from the one or more remaining data elements received by the memory array; and

17. The data processing method of claim 16, wherein determining whether to replace the target element in the first register with the candidate element comprises:

storing the target element and the candidate element in a second register; and

comparing the candidate element to the target element.

18. The data processing method of any of claims 14-17, wherein the plurality of computation patterns further includes a k-means clustering pattern, the data processing method further comprising:

in a case where the in-memory computing device is configured to operate in a k-means clustering mode, clustering a plurality of vectors stored in the memory storage array by repeating:

19. The data processing method of claim 18, wherein assigning each of the plurality of vectors comprises:

receiving the plurality of centroids;

receiving a feature vector selected from the plurality of vectors;

20. The data processing method of claim 19, further comprising:

updating each of the plurality of centroids by calculating an updated centroid from one or more vectors of the memory array having respective cluster identifications, if the in-memory computing device is configured to operate in a k-means clustering mode.

21. The data processing method of any of claims 18-20, further comprising:

in a case that the in-memory computing device is configured to operate in a k-means clustering mode, in response to the plurality of centroids not changing after updating each of the plurality of centroids, outputting a clustering result to the host or the memory array.

22. A non-transitory computer-readable medium storing a set of instructions for execution by one or more computing circuits of an apparatus to cause the apparatus to begin implementing a data processing method, the data processing method comprising:

selecting between a plurality of computing modes based on a configuration, wherein the plurality of computing modes includes a first ordering mode and a second ordering mode;

accessing a plurality of data elements in a memory array of the device; and

in the first or second ordering modes, outputting a first K data elements of the plurality of data elements to the memory array or a host communicatively connected to the apparatus;

23. The non-transitory computer-readable medium of claim 22, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to further perform, in the first ordering mode:

receiving the plurality of data elements from a plurality of logic blocks in the memory array, an

Determining the first K data elements by:

determining a global maximum or minimum element based on the block maximum or minimum elements of the plurality of logical blocks;

disabling the global maximum or minimum element in its logical block; and

24. The non-transitory computer-readable medium of claim 22 or 23, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to further perform, in the second ordering mode:

determining the first K data elements by:

storing a plurality of initial data elements from a memory array to a first register; and

25. The non-transitory computer-readable medium of claim 24, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to determine whether to replace the target element in the first register with the candidate element by:

storing the target element and the candidate element in a second register; and

comparing the candidate element to the target element.

26. The non-transitory computer-readable medium of any one of claims 22-25, wherein the plurality of computing modes includes a k-means clustering mode, and the set of instructions are executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the k-means clustering mode:

27. The non-transitory computer-readable medium of claim 26, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to allocate each of the plurality of vectors by:

receiving the plurality of centroids from a row buffer of the memory array;

28. The non-transitory computer-readable medium of claim 27, wherein the set of instructions are executable by the one or more computing circuits of the apparatus to cause the apparatus to update each of the plurality of centroids by:

29. The non-transitory computer-readable medium of any one of claims 26-28, wherein the set of instructions, which are executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in a k-means clustering mode:

30. A data processing system comprising:

a host; and

a plurality of in-memory computing devices communicatively coupled to the host, wherein any of the plurality of in-memory computing devices includes a memory array configured to store data and computing circuitry configured to execute sets of instructions to cause the in-memory computing device to perform the steps of:

selecting between a plurality of computing modes based on a configuration from the host, the plurality of computing modes including a first ordering mode and a second ordering mode;

outputting the first K data elements of the plurality of data elements to the host in the first sorting mode or the second sorting mode;

In the foregoing specification, embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. Certain adaptations and modifications of the described embodiments can be made. Other embodiments may be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only. The order of steps shown in the figures is also intended for illustrative purposes only and is not intended to be limited to any particular order of steps. Thus, those skilled in the art will appreciate that these steps can be performed in a different order to perform the same method. In the drawings and specification, there have been disclosed exemplary embodiments. However, many variations and modifications may be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation. The scope of the embodiments is defined by the appended claims.

Claims

1. An in-memory computing device, comprising:

a memory array configured to store data; and

disabling the global maximum or minimum element in its logical block; and

the next chunk maximum or minimum element of the logical chunk associated with the global maximum or minimum element that is disabled is obtained.

3. The in-memory computing device of claim 1, the computing circuitry comprising a first register, wherein, in the second ordering mode, the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to determine the top K data elements by:

4. The in-memory computing device of claim 3, the computing circuitry comprising a second register and a scalar arithmetic logic unit, wherein the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to determine whether to replace the target element in the first register with the candidate element:

5. The in-memory computing device of claim 1, wherein the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to:

6. The in-memory computing device of claim 1, wherein the plurality of computing modes comprises a k-means clustering mode in which the computing circuitry is further configured to execute the set of instructions to cause the in-memory computing device to:

updating a plurality of centroids of the plurality of clusters, wherein each of the plurality of centroids is an average of one or more respective vectors assigned to the same cluster.

receiving the plurality of centroids from a row buffer of the memory array;

8. The in-memory computing device of claim 1, further comprising:

9. The in-memory computing device of claim 1, wherein the computing circuitry further comprises:

one or more registers configured to store data for computation;

a storage device configured to store the set of instructions; and

a decoder configured to decode the instruction set.

10. The in-memory computing device of claim 1, wherein the computing circuitry further comprises one or more single instruction multiple data units, one or more reduction units, one or more arithmetic logic units, or any combination thereof.

11. A method of data processing, comprising:

selecting between a plurality of computing modes configured with an in-memory computing device, wherein the plurality of computing modes includes a first ordering mode and a second ordering mode;

outputting, in the first ordering mode or the second ordering mode, a top K data elements of the plurality of data elements to the memory array or to a host communicatively coupled with the in-memory computing device;

12. The data processing method of claim 11, further comprising:

determining the first K data elements by:

disabling the global maximum or minimum element in its logical block; and

13. The data processing method of claim 11, further comprising:

determining a candidate element from one or more remaining data elements received by the memory array; and

14. The data processing method of claim 13, wherein determining whether to replace the target element in the first register with the candidate element comprises:

storing the target element and the candidate element in a second register; and

comparing the candidate element to the target element.

15. The data processing method of claim 11, wherein the plurality of computation patterns further includes a k-means clustering pattern, the data processing method further comprising:

in the case that the in-memory computing device is configured as a k-means clustering pattern, clustering a plurality of vectors stored in the memory storage array by repeating:

16. A non-transitory computer-readable medium storing a set of instructions for execution by one or more computing circuits of an apparatus to cause the apparatus to begin implementing a data processing method, the data processing method comprising:

accessing a plurality of data elements in a memory array of the device; and

outputting a first K data elements of the plurality of data elements to the memory array or to a host communicatively connected to the device in the first ordering mode or the second ordering mode;

17. The non-transitory computer-readable medium of claim 16, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to further perform, in the first ordering mode:

Determining the first K data elements by:

disabling the global maximum or minimum element in its logical block; and

18. The non-transitory computer-readable medium of claim 16, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to further perform, in the second ordering mode:

determining the first K data elements by:

19. The non-transitory computer-readable medium of claim 18, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to determine whether to replace the target element in the first register with the candidate element by:

storing the target element and the candidate element in a second register; and

comparing the candidate element to the target element.

20. The non-transitory computer-readable medium of claim 16, wherein the plurality of computing modes includes a k-means clustering mode, and the set of instructions being executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the k-means clustering mode:

21. The non-transitory computer-readable medium of claim 20, wherein the set of instructions, when executed by the one or more computing circuits of the apparatus, cause the apparatus to allocate each of the plurality of vectors by:

receiving the plurality of centroids from a row buffer of the memory array;

22. A data processing system comprising:

a host; and

wherein K is an integer greater than a threshold if the first sorting mode is selected and is an integer less than or equal to the threshold if the second sorting mode is selected.