CN117435551A

CN117435551A - Computing device, in-memory processing storage device and operation method

Info

Publication number: CN117435551A
Application number: CN202311453628.2A
Authority: CN
Inventors: 容广健; 袁庆; 陈庆; 杨宇
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-01-23

Abstract

A computing device, an in-memory processing storage device and an operating method. The operation method comprises the following steps: receiving a first sequence comprising a plurality of computing tasks, wherein the plurality of computing tasks each comprise an access operation to a storage device; comparing the access addresses of the plurality of computing tasks to the storage device to obtain comparison results, wherein the comparison results indicate whether the plurality of computing tasks have access address conflicts in the process of being executed; obtaining a second sequence comprising the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least an order between a first computing task and a second computing task of the plurality of computing tasks compared with the first sequence so as to reduce or eliminate access address conflicts of the first computing task and the second computing task in the process of being executed; the plurality of computing tasks are sequentially performed based on the second sequence. The operation method can improve the execution efficiency of the in-memory processing storage device.

Description

Computing device, in-memory processing storage device and operation method

Technical Field

Embodiments of the present disclosure relate to a computing device, an in-memory processing storage device, and a method of operation.

Background

In a computing system that uses a common storage device for data storage, the storage device and the computing module (e.g., CPU, etc.) are separate modules. In such computing systems, the data is stored in a storage device separate from the computing module. When the computing module needs to process data, the data is firstly read from the storage unit, then temporary storage of intermediate data in the storage device is also involved in the computing process, and finally, when the computing of the data is completed, the data is transmitted to the storage device for storage. Such a computing architecture, which is separate from the storage and computation, can suffer from significant data transmission loss and latency, and can also create a performance bottleneck for the computing system.

In order to address the degradation of system performance resulting from the transfer of data between a computing module and a storage device that are separate from each other, in-memory processing (processing in memory, PIM) storage devices have been currently studied, which include a computing module and a storage module.

In the PIM storage device, the computing function is integrated near the storage module, so that the processing performance degradation and processing time delay increase caused by the transmission of data between the computing module and the storage module can be effectively reduced. However, since integrating the computing module in the vicinity of the storage module increases the chip area and power consumption of the PIM storage device, the computing module of the PIM storage device generally provides only a small number of computing modules, and how to improve the computing performance of the PIM storage device is one of the problems to be solved. Furthermore, in the event of a conflict between access addresses of storage modules by multiple computing tasks to be processed by the PIM storage device, a too small number of computing modules may make parallelization of the computing tasks difficult to achieve.

Disclosure of Invention

At least one embodiment of the present disclosure provides a method of operating a in-memory Processing (PIM) storage device, the method of operating comprising: receiving a first sequence comprising a plurality of computing tasks, wherein the plurality of computing tasks each comprise an access operation to a storage device; comparing the access addresses of the plurality of computing tasks to the storage device to obtain comparison results, wherein the comparison results indicate whether the plurality of computing tasks have access address conflicts in the process of being executed; obtaining a second sequence comprising the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least an order between a first computing task and a second computing task of the plurality of computing tasks compared with the first sequence so as to reduce or eliminate access address conflicts of the first computing task and the second computing task in the process of being executed; the plurality of computing tasks are sequentially performed based on the second sequence.

For example, in an operation method provided in at least one embodiment of the present disclosure, the obtaining the second sequence of the plurality of computing tasks based on the comparison result includes: based on the comparison result, the ordering of the plurality of computing tasks is adjusted so that at least one other computing task is inserted between two computing tasks with conflicting access addresses, wherein the access addresses of the at least one other computing task do not conflict with the access addresses of the two computing tasks with conflicting access addresses.

For example, in an operating method provided by at least one embodiment of the present disclosure, there is no access address conflict between adjacent computing tasks in the second sequence.

For example, in an operation method provided in at least one embodiment of the present disclosure, access addresses corresponding to N adjacent computing tasks in the second sequence do not collide, N is a positive integer greater than 1 and is determined based on the number of stages of the pipeline required for the computing tasks of the plurality of computing tasks to be executed.

For example, in an operation method provided in at least one embodiment of the present disclosure, the plurality of computing tasks includes an atomic computing operation for executing a data operation instruction, and the plurality of computing tasks are computing tasks respectively executed by a plurality of threads.

For example, in an operation method provided in at least one embodiment of the present disclosure, the plurality of computing tasks includes a plurality of computing task groups, and in the first sequence, at least one computing task group has an access address conflict, and corresponding access addresses between at least two computing task groups do not conflict.

For example, in an operation method provided in at least one embodiment of the present disclosure, the storage device includes a plurality of read-write ports for parallel reading and writing, where the access address conflict includes a conflict caused by different computing tasks operating the same read-write port.

For example, in an operation method provided in at least one embodiment of the present disclosure, the memory device includes a plurality of banks (banks) and each bank corresponds to one read-write port for parallel reading-writing.

For example, in a method of operation provided by at least one embodiment of the present disclosure, the access address conflict includes a conflict caused based on data dependencies.

According to at least one embodiment of the present disclosure, there is provided an in-memory processing storage device, including a storage module, a calculation module, a scheduling module, and a sorting module, wherein: the storage module is configured to store data; the scheduling module is configured to receive a first sequence comprising a plurality of computing tasks, compare access addresses of the computing tasks to the storage module respectively, and obtain comparison results, wherein the comparison results indicate whether the computing tasks have access address conflicts in the process of being executed; the ordering module is configured to obtain a second sequence including the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least an order between a first computing task and a second computing task of the plurality of computing tasks compared to the first sequence to reduce or eliminate access address conflicts of the first computing task and the second computing task in the executed process, and the computing module is configured to sequentially execute the plurality of computing tasks based on the second sequence.

For example, in the in-memory processing storage device provided in at least one embodiment of the present disclosure, access addresses corresponding to N adjacent computing tasks in the second sequence do not conflict, N is a positive integer greater than 1, and is determined based on the number of stages of the pipeline required for the computing tasks of the plurality of computing tasks to be executed.

For example, in the in-memory processing storage device provided in at least one embodiment of the present disclosure, the storage module includes a plurality of read-write ports for parallel reading and writing, where the access address conflict includes a conflict caused by different computing tasks operating on the same read-write port.

For example, in an in-memory processing storage device provided in at least one embodiment of the present disclosure, the storage module includes a plurality of banks (banks), and each bank corresponds to a read-write port for parallel reading and writing.

For example, in the in-memory processing storage device provided in at least one embodiment of the present disclosure, the computing module is an atomic computing module configured to execute data operation instructions, and the plurality of computing tasks are computing tasks respectively executed by a plurality of threads.

According to at least one embodiment of the present disclosure, there is provided a computing device including an in-memory processing storage device provided according to at least one embodiment of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 is a schematic diagram showing a process by which PIM storage performs a computing task without an access address conflict;

FIG. 2 shows a schematic diagram of a computing module processing a plurality of computing tasks using a 3-stage pipeline;

FIG. 3 shows a schematic diagram of performing 64 round robin accumulation on single precision floating point variable a, performing addition on single precision floating point variables b [63:0] and c [63:0] to d [63:0] in a similar parallel fashion;

FIG. 4 is a schematic diagram of a method of operating a PIM storage device provided in accordance with at least one embodiment of the present disclosure;

FIG. 5 illustrates an example flow of a thread for which a scheduling module determines a bank address conflict;

FIG. 6 is a schematic diagram showing the electrical structure and processing steps of a conventional PIM storage device and a PIM storage device provided in accordance with at least one embodiment of the present disclosure;

FIG. 7 illustrates a schematic diagram of a generic PIM storage and processing of multiple computing tasks by the PIM storage provided in accordance with at least one embodiment of the present disclosure;

FIG. 8 illustrates a schematic diagram of a generic PIM storage and processing of multiple computing tasks by the PIM storage provided in accordance with at least one embodiment of the present disclosure;

FIG. 9 illustrates a schematic diagram of a PIM storage arrangement provided in accordance with at least one embodiment of the present disclosure;

fig. 10 illustrates a schematic block diagram of a computing device provided in accordance with at least one embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Unlike common storage devices, in-memory Processing (PIM) storage devices may implement data storage, data sharing, and computing functions. PIM storage devices can generally be used for data sharing and storage of several arithmetic units.

PIM storage generally includes three parts: the system comprises a data storage module, a calculation module and a data scheduling module.

FIG. 1 shows a schematic diagram of a process for performing computing tasks in a PIM storage, such as a shared memory unit used in a General Purpose Graphics Processor (GPGPU). The shared storage unit can realize data storage, data sharing and atomic calculation functions and is used for data sharing and storage of a plurality of operation units in the GPGPU.

For example, the computing module includes an atomic computing module. It is to be understood that references to "atomic computation" in this disclosure refer to computation performed in the manner of an atomic operation, that is, the execution of atomic computation is not interrupted by, for example, the scheduling mechanism of a thread, and such operation, once started, runs all the way to the end without any context switch in between. In embodiments of the present disclosure, the computing module is not limited to being implemented as an atomic computing module, but may be a computing module that performs non-atomic operations.

When the PIM storage device executes a computing instruction of a plurality of computing tasks (e.g., a plurality of threads), as shown in fig. 1, the data scheduling module (e.g., the address core module shown in fig. 1) will first guide the data set InputData 1-N corresponding to the plurality of computing tasks currently input to the PIM storage device, and read the data set RamData 1-N in the data storage module (e.g., the RAM (random access memory) in fig. 1) according to the address set InputAddress 1-N corresponding to the plurality of computing tasks input synchronously with the data set, and then transmit the two data sets and the valid_flag to the computing module to execute the computing of the corresponding instruction, for example, use InputData1 and RamData1 as two add items of the adder in the computing module, to obtain an atomic computing result atom result1, and finally cause the computing result atom result1 to be written to the home position of the mdata1 in the data storage module, to cover the original RamData1, and the rest data to be executed so as to obtain the computing result corresponding to the plurality of tasks atom result 1-atom result 1. The above is an example description of the operation of the above-described exemplary PIM device without conflict of access addresses of the memory module by the plurality of computing tasks processed.

However, when the above-mentioned multiple computing tasks are executed with access address conflicts, it may happen that the input addresses of two or more computing tasks are equal or the read-write ports to be used are the same, i.e. the target addresses of the data storage modules are equal or the high-order parts of the target addresses are the same. In this case, the data scheduling module needs to sequentially execute the computing tasks according to the sequence numbers of the input data related to the multiple computing tasks, and can continue to schedule the remaining data for computing only if the previous sequence number data completely executes the atomic computing operation.

If the input addresses of the calculation tasks corresponding to the adjacent serial numbers are equal, the operation data read from the data storage module by the calculation task executed later is actually the calculation result of the calculation task corresponding to the previous serial number, namely the calculation task executed later has data dependency on the calculation task executed earlier. Assuming that the access address InputAddress1 of the computing task 1 is equal to the access address InputAddress2 of the computing task 2, the atomic computing module will first execute the computing operation of the computing task 1 with the access address of InputAddress1, and needs to write the computing result atom result1 into the position of RamData1, while when executing the computing operation of the computing task 2 with the access address of InputAddress2, ramData2 needs to be read, since InputAddress1 is equal to InputAddress2, that is, the fact that the atom result1 actually represents RamData2, the executing task with the access address of InputAddress2 must wait for the execution of the preface computing task with the access address of InputAddress1 to start to read the required addition.

For example, in computer software programming, it is quite common for there to be access address conflicts similar to the above scenario. For example, code that performs loop accumulation for integer variant a as follows:

int a＝0；

for(i＝0,i<＝63,i++)

begin

a＝a+100；

end

because the number of times of accumulation is larger, the width of the obtained final result is larger, and the counter in the processor cannot meet the calculation requirement, atomic calculation operation needs to be executed to finish the calculation. This requires performing multiple reads and writes to the memory of an address to complete the accumulation operation. In such an operation, since the data required for the last addition is the calculation result of the previous addition, the data between operations has data dependency. For simpler arithmetic operations such as integer variable addition, the addition can be completed in a digital circuit usually only by 1 clock cycle, and the operation unit can increase forwarding logic to improve the working efficiency of the digital circuit.

However, more powerful computer software often requires processing more complex data types, such as single precision floating point numbers, and also requires processing multiple sets of data with data dependencies and data without data dependencies simultaneously. Because the resources of the computing module in the processor are limited, for example, the floating point addition unit needs a plurality of clock cycles in the digital time sequence circuit to finish one operation due to complex operation steps, and a group of data with data dependency can be sent to the pipeline after waiting for the completion of the preamble pipeline each time, the process of the atomic computing operation greatly reduces the working efficiency of the circuit.

For example, assuming that 64 round robin accumulation is performed on single precision floating point variable a, the addition is performed on single precision floating point variables b [63:0] and c [63:0] to get d [63:0]. Assuming that the floating point addition unit has a 3-stage pipeline, it is known through analysis that 64 atomic calculation operations need to be performed on the variable a, and because of the specificity of the operation, pipelining cannot be performed on the calculation module, performing loop accumulation on the variable a requires 64×3=192 clock cycles. While each 64 operands of variables b [63:0] and c [63:0] are not data dependent, and thus can be pipelined on the compute module, the computation of the result d [63:0] requires only 64+3-1=66 clock cycles. To complete the 64-cycle accumulation of single-precision floating-point variable a and the addition of single-precision floating-point variables b [63:0] and c [63:0] requires 192+66=258 clock cycles.

FIG. 2 illustrates a schematic diagram of a computing module employing a 3-stage pipeline (pipeline) to process multiple computing tasks in the event that there is an access address conflict for the computing tasks. The use of a 3-stage pipeline by the computing module means that the operation within the computing module is divided into three stages that each use different hardware, whereby 3 computing tasks at different stages can use the computing module simultaneously without collision. For example, a scalar Central Processing Unit (CPU) may have a 5-stage pipeline, each instruction may issue and execute to completion in a fixed time (e.g., 5 clock cycles) per clock cycle, with execution of each instruction divided into 5 steps: fetch, decode, execute, access, and write back.

As shown in fig. 2, adjacent computing tasks 1 and 2 and adjacent computing tasks 5 and 6 have access address conflicts. In the case of processing a computing task using a 3-stage pipeline (i.e., each computing task itself requires 3 clock cycles (clk)), computing task 2 needs to wait for computing task 1 to process to complete in order to obtain the result of computing task 1 and thus perform its own computation, i.e., computing task 2 has a data dependency on computing task 1, so that computing task 1 has an access address conflict with computing task 2. In this process, there is a waste of 2 clock cycles (2 clock cycles are suspended (shown as "suspend" in fig. 2). Similarly, computing task 5 and computing task 6 are executing similarly with two clock cycles wasted.

However, for other computing tasks with non-conflicting access addresses, the computing can be performed without waiting for the computing results of the other computing tasks, so that the computing tasks can be directly scheduled on the pipeline to start operation, and the waste of clock cycles can be avoided.

Returning to the above example of performing 64-cycle accumulation on the single-precision floating-point variable a and performing addition on the single-precision floating-point variables b [63:0] and c [63:0] to obtain d [63:0], among 64 calculation tasks involved in performing 64-cycle accumulation on the single-precision floating-point variable a, all the other 63 calculation tasks except the first calculation task need to obtain the calculation result of the previous calculation task to be performed, so that access address conflict exists between any two adjacent calculation tasks, and pipeline operation cannot be performed. And no access address conflict exists between 64 computing tasks related to the single-precision floating point variables b [63:0] and c [63:0] and the d [63:0] is obtained by executing addition, so that the pipelining operation can be performed. If the two operations of performing the loop accumulation on the single-precision floating-point variable a 64 times and performing the addition on the single-precision floating-point variables b [63:0] and c [63:0] are sequentially performed, that is, the operation of performing the loop accumulation on the single-precision floating-point variable a 64 times and then performing the addition on the single-precision floating-point variables b [63:0] and c [63:0] is performed, or vice versa, as described above, 64×3+66=258 clocks are required. However, if the above two operations are performed in a parallel-like manner, the overall time required for execution can be shortened.

FIG. 3 shows a schematic diagram of performing 64 round robin accumulation on single precision floating point variable a and performing addition on single precision floating point variables b [63:0] and c [63:0] to d [63:0] in a parallel-like manner.

As shown in fig. 3, 64 calculation tasks involved in adding the single-precision floating-point variables b [63:0] and c [63:0] are interspersed between 64 calculation tasks involved in performing 64-cycle accumulation on the single-precision floating-point variable a, so that the total execution time only needs 63×3=192 clock cycles.

Fig. 4 is a schematic diagram of a method of operating a PIM storage device provided in accordance with at least one embodiment of the present disclosure. The PIM storage device includes, for example, a storage module and a computing module.

As shown in fig. 4, a method of operating a PIM storage device provided in accordance with at least one embodiment of the present disclosure includes the steps of:

step 401: a first sequence is received that includes a plurality of computing tasks, wherein the plurality of computing tasks each include an access operation to a storage device.

The access operation includes a read operation or a write operation to the memory device, i.e., a read operation or a write operation to the memory module.

Step 402: comparing the access addresses of the plurality of computing tasks to the storage device respectively to obtain a comparison result, wherein the comparison result indicates whether the plurality of computing tasks have access address conflict in the process of being executed.

The access address to the memory device is the access address (memory address) in the memory module.

Step 403: and obtaining a second sequence comprising the plurality of computing tasks based on the comparison result, wherein the second sequence at least adjusts the sequence between a first computing task and a second computing task in the plurality of computing tasks compared with the first sequence so as to reduce or eliminate access address conflict of the first computing task and the second computing task in the process of being executed.

The first computing task and the second computing task herein are computing tasks as description objects, and the embodiments of the present disclosure are not particularly limited thereto; here, reducing the access address conflict means that one of the first computing task and the second computing task requires less time to wait for the other of the first computing task and the second computing task to end access to the corresponding memory address before the associated memory address can be accessed, as compared to the case where the access address conflict is not reduced; eliminating an access address conflict means that one of the first computing task and the second computing task can directly access the associated memory address without waiting for the other of the first computing task and the second computing task to end access to the corresponding memory address.

Step 404: the plurality of computing tasks are sequentially performed based on the second sequence.

Thus, in the case of executing the plurality of computing tasks in the second sequence, there may be fewer or no access address conflicts during execution of the plurality of computing tasks. Thus, by reducing or eliminating access address conflicts, or executing multiple computing tasks in an order that reduces or eliminates access address conflicts, the methods provided in accordance with embodiments of the present disclosure may enable multiple computing tasks to be executed in a more efficient manner and reduce latency as much as possible.

For example, in one implementation, based on the comparison result, the ordering of the plurality of computing tasks may be adjusted such that at least one other computing task is interposed between two computing tasks having conflicting access addresses, wherein the access address corresponding to the at least one other computing task does not conflict with the access addresses of the two computing tasks having conflicting access addresses.

In this way, the two computing tasks with the conflict of the access addresses can be interleaved to execute the computing tasks which do not have the conflict of the access addresses with the two computing tasks, so that the clock cycle waste caused by the fact that the latter computing task needs to wait for the execution of the former computing task to finish is avoided, and the pipelining operation of a plurality of computing tasks can be realized. This can greatly increase the processing efficiency and processing speed of the PIM storage device for the computing task.

For example, in one implementation, there is no access address conflict between adjacent computing tasks in the second sequence.

For example, in one implementation, access addresses corresponding to adjacent N computing tasks in the second sequence do not conflict, N is a positive integer greater than 1 and is determined based on the number of stages of the pipeline required for a computing task of the plurality of computing tasks to be executed.

For example, in one implementation, the plurality of computing tasks includes atomic computing operations for executing data operation instructions. For example, at least some of the plurality of computing tasks correspond to single precision floating point operations. In one example, the single precision floating point number operation may employ 3 stages of pipelining, and then access addresses corresponding to adjacent 3 computing tasks in the second sequence do not conflict.

For example, in one implementation, the plurality of computing tasks are computing tasks that are performed by a plurality of threads, respectively.

For example, in one implementation, the plurality of computing tasks includes a plurality of computing task groups, and in the first sequence, access addresses within at least one computing task group conflict, and corresponding access addresses between at least two computing task groups do not conflict.

For example, in one implementation, a PIM storage device according to an embodiment of the present disclosure includes a plurality of read-write ports for parallel reading and writing, where the above-mentioned access address conflict includes a conflict caused by different computing tasks operating the same read-write port.

For example, in one implementation, a PIM storage device according to an embodiment of the present disclosure includes a plurality of banks (banks). For example, in this case, the access address of the computing task to the storage device includes a bank address.

For example, in one implementation, each bank corresponds to a read-write port for parallel reading and writing. For example, the access of the computing task to the memory bank includes an access of the computing task to a read-write port. For the same memory bank, if one computing task does not end the access to the read-write port of the memory bank, another computing task that also needs to access the read-write port of the memory bank needs to wait for the access on the read-write port to end before accessing the read-write port.

In one implementation, the access address conflict includes a conflict based on data dependencies. For example, among the multiple computing tasks involved in performing loop accumulation on variables, the execution of a computing task depends on the computation result of the previous computing task, and thus there is a data dependency between adjacent computing tasks, which may result in an access address conflict.

It is to be appreciated that one or more embodiments of the present disclosure are not limited to access address conflicts based on data dependencies, but may be applicable to any situation where access addresses conflict.

For convenience of description, single-precision floating point addition operation involving a computing task as a thread, single-precision floating point addition employing a 3-stage pipeline, and access address conflict as a bank conflict (bank conflict) will be described below as an example. It should be understood that such description is for ease of description and ease of understanding only and is not intended to limit the scope of the present disclosure to only single precision floating point number addition operations or access address conflicts involving threads employing 3-stage pipelines only as bank conflicts. The scope of the present disclosure should be determined by the claims or the technical spirit described throughout the specification.

As described above, the PIM class storage circuit includes a storage module, a calculation module (e.g., an atomic calculation module), and a scheduling module. The storage module is used for storing data, and can be divided into a plurality of storage banks (banks) and channels (lanes) to meet the requirement of parallel reading and writing. The scheduling module is used for calculating the data read-write address and scheduling the data according to the data dependency. The computing module is used for executing instructions of data operation, and 1 or more computing units can be contained in the computing module according to the complexity of different instructions. The computing module is the biggest difference between the PIM storage unit and the common storage unit, and the PIM storage circuit can directly integrate the computing function near the storage unit, so that the loss and delay caused by data transmission can be reduced, the parallelization of data processing can be realized by utilizing the characteristics of the storage module, and the PIM storage circuit is commonly used for chips such as a general-purpose graphics processor (GPGPU).

For example, adding a computing module near a storage module results in increased area and power consumption, and the complexity of data scheduling will also increase, so that the computing module can be reduced, e.g., by providing only a small number of atomic computing units. The atomic calculation unit supports operation operations such as integer addition and subtraction, single-precision floating point number size comparison, single-precision floating point number addition and the like, wherein the single-precision floating point number addition has complex operation steps, the corresponding operation unit has larger area, and in order to reduce the area, the single-precision floating point number addition unit only comprises 1 floating point number addition unit, and parallelization of the operation cannot be realized.

Taking a data share (hereinafter referred to as DS) unit integrated in the GPGPU as an example, the data share unit is a PIM class storage unit, where a scheduling module needs to cope with multi-thread operation of the GPGPU, and schedule addresses and data corresponding to multiple threads (threads) to read and write in parallel. The scheduling module needs to determine whether the addresses of the banks of 2 or more threads are equal, if they are equal, they need to be arranged according to the sequence of the thread numbers, for example, if the addresses of the banks of the threads t0 and t1 are all 0, that is, the 2 threads need to read and write the bank 0, and since each bank of the memory module has only one read and write port, the read and write operation of the thread t1 is preferentially executed, and the read and write operation of the thread t0 is executed in the next read and write cycle, which is similar to the atomic operation with data dependency.

By combining the principles of atomic operation and data dependency, analysis can know that when a DS unit encounters a memory bank conflict, the plurality of threads which generate the memory bank conflict need to wait for the completion of the read-write operation of some threads to start execution, so that the circuit working efficiency is reduced. Because only one floating point number addition operation unit exists, if the threads which do not conflict with the memory bank cannot be scheduled to the redundant operation units for parallel operation, the work efficiency of the circuit is seriously reduced. Embodiments of the present disclosure may thus improve the efficiency of the storage circuitry in performing complex computational tasks (e.g., single precision floating point addition) in PIM class storage units without increasing register resources.

In the PIM class memory circuit, the scheduling module itself includes at least in part digital logic for screening out threads with equal bank addresses. The part of digital logic compares the memory bank addresses of all threads with each other, and picks out the threads with memory bank conflict, so that floating point addition operation of data related to the threads with larger numbers is preferentially executed.

FIG. 5 illustrates an example flow of a thread for which the scheduling module determines a bank address conflict. As shown in fig. 5, the scheduling module may compare the bank addresses corresponding to the operations of any two threads in the plurality of threads to obtain a comparison result regarding whether the bank addresses corresponding to the operations of any two threads are equal. Based on the obtained comparison, the scheduling module may obtain a first thread set in which a bank address conflict occurs. Because of the possible data dependencies between threads that have a bank address conflict, to ensure that execution of the threads produces a correct output result, the scheduling module may execute the operations of the plurality of threads in such a manner that the higher numbered threads in the first thread set execute preferentially.

As previously described, the data dependencies among multiple computing tasks that have memory address conflicts can cause execution of the computing tasks to wait for execution of the predecessor computing tasks to complete, thus making the overall execution of the memory device inefficient. If the floating point addition unit adopts a 3-stage pipeline, if a memory bank conflict exists between two adjacent threads, the pipeline operation can be performed only by the other independent threads after the threads with larger numbers are executed, which causes low execution efficiency.

According to the storage device provided by the embodiment of the disclosure, the operation of the plurality of threads can be ordered by using the comparison result obtained by the scheduling module, so that the conflict of access addresses to the storage module when the plurality of threads are executed is reduced or eliminated. In at least one example, only a portion of the digital logic may be added by multiplexing the bank address comparison portion in the scheduling module to extract independent threads whose bank addresses are not equal to each other and reorder the order of thread execution, rather than strictly executing in the order of the size of the thread numbers, thus improving the efficiency of executing multiple threads (e.g., thread groups including those where bank address conflicts and those where bank addresses do not) where there are bank address conflicts, with only a small amount of digital logic added. Because only address information needs to be processed, no additional register is needed to store thread data, and the added digital logic is smaller, but the execution efficiency of the memory circuit can be greatly improved and the waiting time can be greatly reduced.

Fig. 6 is a schematic diagram illustrating a circuit configuration and processing steps of a PIM storage device provided in accordance with at least one embodiment of the present disclosure.

As shown in fig. 6, a scheduling module included in a PIM storage device according to at least one embodiment of the present disclosure may further include reordering logic, in addition to an address core module, for reordering the input address set InputAddress 1-N lines corresponding to a plurality of computing tasks (e.g., a plurality of threads). The scheduling module inputs the reordered addresses to a memory module (RAM shown in fig. 6). Correspondingly, the computing module (such as an atomic computing module) calculates a result atom result 1-N based on input data sets InputData 1-N corresponding to a plurality of computing tasks, data sets RamData 1-N read from RAM according to the reordered InputAddress 1-N, and valid flag signals from the scheduling module.

It will be appreciated that although the scheduling module shown in fig. 6 includes reordering logic, this is merely an example implementation. The PIM storage provided in accordance with at least one embodiment of the present disclosure may also include the scheduling module and the reordering logic as separate components (i.e., the reordering logic is included in the PIM storage as separate components rather than in the scheduling module).

FIG. 7 illustrates a schematic diagram of a generic PIM storage and processing of multiple computing tasks by the PIM storage provided in accordance with at least one embodiment of the present disclosure.

As shown in fig. 7, assuming that the computing task corresponding to each thread needs 3 clock cycles (clk) to be completed, the storage device needs to process the threads numbered 0 to 8 sequentially, and the access addresses corresponding to the threads 0 to 8 are 1, 3, 2, 5, 8, 4, and 7, respectively, that is, the adjacent threads 1 and 2 have access address conflicts, and the adjacent threads 5 and 6 have access address conflicts.

The processing of multiple computing tasks by a common storage device is shown in dashed box 701. The PIM storage device provided in accordance with at least one embodiment of the present disclosure handles a plurality of computing tasks as indicated by dashed box 702. Because the threads t3 and t4 without the access address conflict are inserted between the pipelines of the threads t1 and t2 with the access address conflict to be executed, the time originally used for waiting is effectively utilized to process the threads, the total processing time can be shortened, and the processing efficiency can be improved.

FIG. 8 illustrates a schematic diagram of a generic PIM storage and processing of multiple computing tasks by the PIM storage provided in accordance with at least one embodiment of the present disclosure.

In the example shown in FIG. 8, the plurality of computing tasks includes 3 thread groups (group 1: threads 0 and 1, group 2: threads 2 and 3, group 3: threads 4 and 5) with conflicting access addresses and threads 6, 7, 8 with non-conflicting access addresses. The access addresses between the 3 thread groups do not conflict. When the computing tasks corresponding to the threads are put into the 3-stage pipeline of the floating point arithmetic unit, the processing of the plurality of computing tasks by the common PIM storage device and the PIM storage device provided according to at least one embodiment of the present disclosure are shown in dashed boxes 801 and 802, respectively.

As shown by 801 and 802 in fig. 8, by inserting the thread t2 in the group 2 and the thread t4 in the group 3, etc. (indicated by a dotted line box 802) between the threads t0 and t1 (group 1) having the access address conflict, respectively, 3 groups of threads, each of which has the access address conflict, can be made to perform the complete 3-stage pipeline operation without waiting. The PIM storage device provided in accordance with at least one embodiment of the present disclosure may therefore operate more efficiently than conventional PIM storage devices.

Fig. 9 illustrates a schematic diagram of a PIM storage device 900 provided in accordance with at least one embodiment of the present disclosure.

As shown in fig. 9, a PIM storage device 900 provided according to at least one embodiment of the present disclosure includes a storage module 901, a calculation module 902, a scheduling module 903, and a ranking module 904.

The storage module 901 is configured to store data. The memory module 901 may be, for example, a semiconductor memory module, for example, SRAM, DRAM, a register array, or the like, and may be embodied, for example, as RAM in the example shown in fig. 6.

The scheduling module 903 is configured to receive a first sequence including a plurality of computing tasks, and compare access addresses of the plurality of computing tasks to the storage module respectively to obtain a comparison result, where the comparison result indicates whether the plurality of computing tasks have access address conflicts in the process of being executed. The scheduling module 903 may be implemented, for example, by digital logic circuitry, and may include, for example, comparison logic, etc. For example, the dispatch module 903 may be embodied as the dispatch module shown in FIG. 6, or may be embodied as an address core in the dispatch module shown in FIG. 6.

The ranking module 904 is configured to obtain a second sequence including the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least an order between a first computing task and a second computing task of the plurality of computing tasks compared to the first sequence to reduce or eliminate access address conflicts of the first computing task and the second computing task during execution. The sorting module 904 may be implemented, for example, by digital logic circuitry, and may include sorting logic, for example. For example, the ordering module may be embodied as the scheduling module shown in FIG. 6, or may be embodied as the reordering logic in the scheduling module shown in FIG. 6.

The computing module 902 is configured to sequentially perform the plurality of computing tasks based on the second sequence. The calculation module 902 may be, for example, an atomic calculation module, and may implement, for example, integer addition and subtraction, single-precision floating point size comparison, single-precision floating point addition, and other operation operations. For example, the computing module 902 may be embodied as an atomic computing module as shown in FIG. 6.

For example, in one implementation, access addresses corresponding to adjacent N computing tasks in the second sequence do not conflict, N is a positive integer greater than 1 and is determined based on the number of stages of the pipeline required for a computing task of the plurality of computing tasks to be executed. For example, if the plurality of computing tasks corresponds to a single precision floating point addition operation, N may be 3, or N may also be 4, depending on the number of stages of the pipeline set for the single precision floating point operation.

In a PIM storage device provided in accordance with at least one embodiment of the present disclosure, the storage module includes a plurality of read and write ports for reading and writing in parallel.

For example, in one implementation, the access address conflict includes a conflict caused by different computing tasks operating on the same read-write port.

In a PIM storage device provided in accordance with at least one embodiment of the present disclosure, the storage module includes a plurality of banks (banks), for example, in which case the access address of the computing task to the storage device includes a bank address.

In a PIM storage device provided in accordance with at least one embodiment of the present disclosure, the computing module is an atomic computing module configured to perform data operation instructions, such as single precision floating point operations, multi-precision floating point operations, and the like.

At least some embodiments of the present disclosure also provide a computing device comprising a PIM store of any one of the embodiments described above. Fig. 10 is a schematic block diagram of a computing device provided in accordance with at least one embodiment of the present disclosure.

For example, in some implementations, the computing device 1000 may be a General Purpose Graphics Processor (GPGPU) or an electronic device that includes the GPGPU, or the like.

A computing device provided in accordance with at least one embodiment of the present disclosure includes a PIM storage device provided in accordance with at least one embodiment of the present disclosure, and thus the computing device may implement any function or operation that the PIM storage device provided in accordance with at least one embodiment of the present disclosure may implement.

Computing devices in embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, as well as stationary terminals such as digital TVs, desktop computers, and the like. The computing device 1000 illustrated in fig. 10 is merely an example and should not be taken as limiting the functionality and scope of use of embodiments of the present disclosure.

For example, as shown in fig. 10, in some examples, computing device 1000 includes a processing device (e.g., central processor, graphics processor, etc.) 1001, or the processing device may include a PIM storage device of any of the embodiments described above. In the RAM 1003, various programs and data necessary for the operation of the computing device are also stored. The processing device 1001, the ROM 1002, and the RAM 1003 are connected thereto via a bus 1004. For example, in one implementation, where the processing device includes the PIM storage device of any of the embodiments described above, the computing device 1000 may not include at least one of the ROM 1002, RAM 1003, and storage device 1008 in fig. 10. An input/output (I/O) interface 1005 is also connected to bus 1004.

For example, the following components may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1007 including a Liquid Crystal Display (LCD), speaker, vibrator, etc.; storage 1008 including, for example, magnetic tape, hard disk, etc.; for example, communication means 1009 may also include a network interface card such as a LAN card, modem, etc. The communication device 1009 may allow the computing device 1000 to communicate with other apparatuses wirelessly or by wire to exchange data, performing communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. Removable media 1011, such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, or the like, is mounted on drive 1010 as needed so that a computer program read therefrom is mounted to storage 1008 as needed. While fig. 10 illustrates a computing device 1000 comprising various devices, it should be understood that not all illustrated devices are required to be implemented or included. More or fewer devices may be implemented or included instead.

For example, the computing device 1000 may further include a peripheral interface (not shown), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication means 1009 may communicate with a network, such as the internet, an intranet, and/or a wireless network, such as a cellular telephone network, a wireless Local Area Network (LAN), and/or a Metropolitan Area Network (MAN), and other devices via wireless communication. The wireless communication may use any of a variety of communication standards, protocols, and technologies including, but not limited to, global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), wideband code division multiple Access (W-CDMA), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi (e.g., based on the IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over Internet protocol (VoIP), wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.

For example, the computing device 1000 may be any device such as a mobile phone, a tablet computer, a notebook computer, an electronic book, a game console, a television, a digital photo frame, a navigator, a server, or any combination of data processing devices and hardware, which is not limited in the embodiments of the present disclosure.

It should be noted that, for clarity and brevity, not all of the constituent elements of the computing device or PIM storage device are presented in the disclosed embodiments. To achieve the necessary functionality of the computing device or PIM storage, one skilled in the art may provide, arrange other constituent elements not shown according to specific needs, as embodiments of the present disclosure are not limited in this regard.

Technical effects regarding the computing device or the PIM storage device in the different embodiments may refer to technical effects of the operation method of the PIM storage device provided in the embodiments of the present disclosure, and are not described herein.

The following points need to be described:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the disclosure, which is defined by the appended claims.

Claims

1. A method of operating an in-memory processing storage device, comprising:

receiving a first sequence comprising a plurality of computing tasks, wherein the plurality of computing tasks each comprise an access operation to the storage device;

comparing the plurality of computing tasks to obtain comparison results for the access addresses of the storage device respectively, wherein the comparison results indicate whether the plurality of computing tasks have access address conflicts in the process of being executed;

obtaining a second sequence comprising the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least the order between a first computing task and a second computing task in the plurality of computing tasks compared with the first sequence so as to reduce or eliminate access address conflict of the first computing task and the second computing task in the executed process;

the plurality of computing tasks are sequentially performed based on the second sequence.

2. The method of operation of claim 1, wherein the deriving a second sequence of the plurality of computing tasks based on the comparison result comprises:

And based on the comparison result, adjusting the ordering of the plurality of computing tasks so that at least one other computing task is inserted between two computing tasks with conflicting access addresses, wherein the access addresses corresponding to the at least one other computing task are not conflicting with the access addresses of the two computing tasks with conflicting access addresses.

3. The method of operation of claim 1, wherein there is no access address conflict between adjacent computing tasks in the second sequence.

4. The method of operation of any of claims 1-3, wherein access addresses corresponding to adjacent N computing tasks in the second sequence do not conflict, N being a positive integer greater than 1 and being determined based on a number of stages of a pipeline required for a computing task of the plurality of computing tasks to be executed.

5. The method of operation of claim 4, wherein the plurality of computing tasks comprise atomic computing operations for executing data operation instructions, and the plurality of computing tasks are computing tasks that are executed by a plurality of threads, respectively.

6. The method of operation of any of claims 1-3, wherein the plurality of computing tasks comprises a plurality of computing task groups,

In the first sequence, access addresses conflict in at least one computing task group, and corresponding access addresses between at least two computing task groups do not conflict.

7. The operating method according to claim 1 to 3, wherein the storage device comprises a plurality of read-write ports for parallel reading-writing,

the access address conflict comprises conflicts caused by different computing tasks operating the same read-write port.

8. The method of operation of claim 7, wherein the memory device comprises a plurality of memory banks and each memory bank corresponds to a read-write port for parallel reading-writing.

9. The method of operation of any of claims 1-3, wherein the access address conflict comprises a conflict based on data dependencies.

10. The in-memory processing storage device comprises a storage module, a calculation module, a scheduling module and a sequencing module, wherein:

the storage module is configured to store data;

the scheduling module is configured to receive a first sequence comprising a plurality of computing tasks, compare the access addresses of the computing tasks to the storage module respectively, and obtain comparison results, wherein the comparison results indicate whether the computing tasks have access address conflicts in the process of being executed;

The ordering module is configured to obtain a second sequence including the plurality of computing tasks based on the comparison result, wherein the second sequence adjusts at least an order between a first computing task and a second computing task of the plurality of computing tasks compared to the first sequence to reduce or eliminate access address conflicts of the first computing task and the second computing task in the process of being executed,

the computing module is configured to sequentially perform the plurality of computing tasks based on the second sequence.

11. The in-memory processing storage device of claim 10, wherein access addresses corresponding to adjacent N computing tasks in the second sequence do not conflict, N being a positive integer greater than 1 and being determined based on a number of stages of a pipeline required for a computing task of the plurality of computing tasks to be executed.

12. The in-memory processing storage device of claim 10, wherein said storage module comprises a plurality of read-write ports for parallel reading and writing,

13. The in-memory processing storage device of claim 12, wherein the memory module comprises a plurality of memory banks, and each memory bank corresponds to a read-write port for parallel reading-writing.

14. The in-memory processing storage device of any of claims 10-13, wherein the computing module is an atomic computing module configured to execute data manipulation instructions and the plurality of computing tasks are computing tasks that are executed by a plurality of threads, respectively.

15. A computing device comprising an in-memory processing storage device according to any of claims 10-14.