CN117667039A

CN117667039A - In-memory computing MIMD Boolean logic compiler based on bit serial

Info

Publication number: CN117667039A
Application number: CN202311552949.8A
Authority: CN
Inventors: 汤晨宇; 何哲陟; 聂晨
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-03-08

Abstract

The invention discloses a MIMD Boolean logic compiler based on bit serial in-memory calculation, which comprises: a MIMD instruction compiler is calculated based on the memory of the bit serial, verilog files are processed row by row to generate intermediate representation, a logic netlist of a directed acyclic graph is described, abstract computing resources are Block, computing resources are divided according to the Sub-array quantity, the instruction stream quantity is flexibly adjusted, a mapping scheme is designated for each division scheme, and scheduling overhead is estimated; and the simulator is used for verifying the correctness of the compiler, generating test data, analyzing instructions, performing function level simulation and verifying results. Compiling a simple logic calculation task into a Boolean operation instruction which can be executed based on the calculation in the bit serial memory through a compiler, so as to realize more efficient parallel calculation; and a basic model and a module division scheme of MIMD task scheduling based on a bit serial memory internal computing system are formulated, so that the task scheduling is more accurate and efficient.

Description

In-memory computing MIMD Boolean logic compiler based on bit serial

Technical Field

The invention relates to the technical field of computer system structures, in particular to a memory computing MIMD Boolean logic compiler based on bit serial.

Background

In existing in-memory computing systems based on bit-serial, boolean logic computing tasks are often performed by SIMD (single instruction multiple data stream) means. This approach provides a better result when processing large-scale data, but in the case of smaller data volumes, the computing resources are often underutilized. Meanwhile, the calculation mode of SIMD also causes that the existing in-memory calculation based on bit serial often adopts simpler topological ordering or greedy strategy for the scheduling of calculation tasks, and the optimization space of a scheduling scheme is very limited. In contrast, the MIMD (multi-instruction multi-data stream) calculation mode can greatly expand the optimization space of the scheduling scheme, but puts higher demands on the scheduling quality.

Furthermore, although research related to conventional multi-core computing scheduling has been very intensive, MIMD computing schemes based on in-memory computing and conventional multi-core computing have been in conflict on a large number of basic assumptions. For example, conventional multi-core computing algorithms often assume that communications between different computing units are asynchronous, because asynchronous transmission mechanisms are introduced between the different computing units, enabling the different computing units to focus on computing tasks that are or are to be performed by themselves, without the need to dedicate additional time to process the communication of data. At the same time, the internal capacity of the different computing units is often also assumed to be infinite, which means that intermediate computing results of a certain computing unit will always be retained in that computing unit. However, the local storage capacity of virtually any computing unit is limited, and in particular, for in-memory computing, which is itself a memory-based computing system, the size of its capacity is one of the important factors that must be considered. Therefore, although the conventional multi-core computing scheme can provide a part of reference value, the conventional multi-core computing scheme cannot be directly applied to a MIMD computing system based on bit serial in-memory computing.

Therefore, it is necessary to solve how to implement efficient MIMD boolean logic task scheduling in a bit-serial based in-memory computing system, fully utilize computing resources, improve scheduling quality, and consider both limited capacity of computing units and synchronous communication mechanisms.

Disclosure of Invention

In view of this, the present invention proposes a MIMD boolean logic compiler based on bit-serial in-memory computation to solve the problems existing in the prior art, and improve the utilization rate and scheduling quality of in-memory computing systems.

The specific technical scheme of the invention is as follows:

a bit-serial-based in-memory computing MIMD Boolean logic compiler, which comprises a module 1 and a module 2; the method comprises the steps that a module 1 is a MIMD instruction compiler based on bit serial in-memory calculation, the module 1 processes Verilog files line by line to generate intermediate representation, describes a logic netlist of a directed acyclic graph, abstracts calculation resources to be blocks, divides the calculation resources according to Sub-array quantity, flexibly adjusts instruction stream quantity, appoints a mapping scheme for each division scheme and estimates scheduling overhead; the module 2 is a simulator for verifying the correctness of the compiler, and the module 2 generates test data, analysis instructions, function level simulation and verifies the result.

Specifically, the module 1 includes a Verilog file parsing tool, which takes as input a combinational logic Verilog file meeting specific format requirements, and performs a line-by-line analysis process to generate an intermediate representation suitable for subsequent processing.

In particular, module 1 includes a directed acyclic graph describing the same logical netlist as the input file.

Specifically, module 1 includes a computing resource abstraction model that abstracts concrete computing hardware resources into computing units that can be used directly by a compiler.

Specifically, the module 1 includes a division scheme and a mapping Sub-module of the computing resource, where the division scheme and the mapping Sub-module of the computing resource determine the number of computing units Block that can be controlled by the module according to the Sub-array number in the hardware.

Specifically, the module 1 includes a task scheduling algorithm, where the task scheduling algorithm is configured to assign a mapping scheme to each partition scheme and estimate scheduling overhead on the premise of meeting constraints of in-memory computing capacity and synchronous communication mechanisms.

In particular, module 1 comprises a linear programming system that makes a scheduling plan at all possible divisions and mapping schemes and estimates the cost required to traverse the entire computational graph under the scheduling plan, so that the user completes the computational tasks in hand in as short a time as possible.

Specifically, the module 1 includes an instruction generation system that uses a defined instruction system to equivalently describe the final instructions generated by the computing scheme for the simulator to verify the results and translate them into equivalent binary code for actual execution by the hardware in future operations.

Specifically, the module 2 includes: the test data generation module can randomly generate input data with specified quantity and specified bit width according to the needs of testers; the instruction analysis module can directly read the logic operation instruction generated in the module 1 and store the logic operation instruction in a memory; the instruction function level simulation module can load test data through the generated test data, then execute logic calculation instructions strictly according to the sequence, and output simulation results to external files according to actual needs after all calculation is completed.

Specifically, the module 2 further comprises: the result verification tool directly carries out simulation calculation on the same test data by taking the Verilog file in the module 1 as input through other simulation tools, and compares the test result with the simulation result one by one, and if the two files are identical, the result correctness verification test proves that the logic operation instruction generated by the compiler is correct; if the two files have differences, the reasons of the differences need to be further analyzed, and the compiler is correspondingly corrected and optimized.

The invention has the beneficial effects that:

(1) Defining the execution mode of the in-memory computing MIMD instruction, wherein the mode enables a compiler to compile a simple logic computing task into a Boolean operation instruction which can be executed based on bit serial in-memory computing;

(2) A compiler which can compile simple logic calculation tasks into MIMD Boolean operation instructions for calculation execution based on bit serial memory is designed;

(3) The compiler can parse and translate Verilog combinational logic modules in a specific format, which enables us to translate some commonly used combinational logic modules into boolean operation instructions that can be executed based on bit serial in-memory computations;

(4) The compiler can group and map the in-memory computing units, which enables us to assign different computing tasks to different computing units, thereby achieving parallel computing;

(5) The compiler can analyze specific computing tasks, which enables us to know the complexity and required resources of each computing task, thereby better scheduling;

(6) From the hardware information of the computing units and the computing tasks, the compiler can determine a scheduling scheme, which enables us to select the most appropriate computing unit for each computing task, thereby achieving optimal scheduling;

(7) A scheduling scheme based on a critical path is designed, and is suitable for a synchronous communication mechanism, and the limited capacity of a computing unit is considered, so that computing resources can be better utilized, and the computing speed is improved;

(8) An emulator is designed to verify the correctness of the instruction calculation results, which enables us to verify the correctness and reliability of the compiler.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a basic workflow diagram of a compiler of the present application;

FIG. 2 is a schematic diagram of a memory architecture of a bit-serial based in-memory computing unit according to the present application;

FIG. 3 is a schematic diagram of a basic calculation mode inside the in-memory calculation of the present application;

FIG. 4 is a schematic diagram of different partitioning schemes of the present application;

FIG. 5 is a hierarchical schematic diagram of a complete computing scheme of the present application;

FIG. 6 is a schematic view of node scheduling of the computational graph of the present application;

fig. 7 is an effect evaluation chart of the present application.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The application provides a bit-serial-based in-memory computing MIMD Boolean logic compiler, which comprises a module 1-a bit-serial-based in-memory computing MIMD instruction compiler and a module 2-a simulator for verifying the correctness of the compiler. As shown in fig. 1, the architecture and the operation logic of the macro level of the present application are shown, and (2) and (3) in the figure are the main contributions of this patent. (2) A Compiler is a program that converts high-level programming language code into machine code or low-level code, and the Compiler receives source code as input and converts it into executable object code that can be run on a computer. (3) A Simulator is a program or device that simulates a real environment or system. Furthermore, configurator (Config): is responsible for managing hardware resources such as processors and other components. Specification generator (PIM Specification Generator): information and knowledge of the configurator is used to create the appropriate specification. Instruction stream system (Instruction Stream): the specification is converted into machine code and sent to an executor or application executor. These flows work cooperatively within the PIM to achieve efficient and reliable computing power.

Module 1-in-memory computing MIMD instruction compiler based on bit-serial, comprising the following sub-modules:

submodule 1.1 Verilog file parsing tool. The part takes as input the combinational logic Verilog file meeting the specific format requirements, analyzes and processes row by row, and generates an intermediate representation form suitable for subsequent processing.

Sub-module 1.2: the middle representation of the directed acyclic graph. This section essentially describes the same logical netlist as the input file, which is essentially a Directed Acyclic Graph (DAG), in which the most significant elements are nodes in the graph and edges connecting the different nodes. The node represents a logical operation, the edge pointing to the node represents the source node of the data needed by the calculation, and the edge starting from the node points to other nodes depending on the calculation result generated by the node. The weight of the node represents the cost of the calculation, and the weight of the edge represents the cost of data transmission generated when two calculation nodes at two ends of the edge are not in the same calculation unit.

As shown in fig. 2-3, the underlying hardware structure of in-memory computation based on bit-serial is presented, describing the manner in which basic boolean logic operations are implemented. Wherein, in fig. 2: (1) column selector (Column Multiplexer): as a splitter it is responsible for selectively transmitting an input signal to an output signal. This is achieved by selecting one of the plurality of input signals as the output signal at a particular instant. (2) Computational Logic and memory controller (SA & computer Logic): this section is responsible for handling various computational logic, including arithmetic logic and boolean logic, etc., and controlling the read and write operations of the memory. It corresponds to the brain of a computer, handles various computing tasks and ensures that data is stored and accessed correctly. (3) an Output drive or control unit (Output Driver): as an output section, it is responsible for outputting the calculation result or the stored data to an external device or a next stage circuit. This section typically has level shifting and current driving capabilities to ensure that data can be properly transferred over long distances or under high loads. These components cooperate to perform the primary functions of in-memory computation based on bit-serial, including reading, writing, and processing of data. They operate in a certain time sequence and logic relationship to accomplish specific computing tasks or data processing tasks.

Sub-module 1.3: a computing resource abstraction model. This section abstracts the concrete computing hardware resources into computing units that can be used directly by the compiler. As shown in fig. 1, the bottom computing unit is a Sub-array, which is composed of a certain number of bit rows and columns. The same bit operation can be performed in parallel from row to row each time a logical operation is performed. Every 4 Sub-arrays are connected into a Mat through the H-Tree structure, and every 4 Mat are connected into a Bank through the H-Tree structure. The whole hardware part is composed of a plurality of banks. At the software level of the compiler, the bottom-most computing resource is called a Block, which actually corresponds to a Sub-array that exists in detail.

Sub-module 1.4: a partitioning scheme and a mapping scheme of computing resources. After confirming the abstract model of the computing resource, the compiler can confirm the number of the computing units Block which can be controlled by the compiler according to the number of Sub-arrays in the hardware. For a MIMD system it is necessary to determine how many computing units are in the system per instruction stream. In order to improve the utilization rate of the computing resources and facilitate dividing the computing resources, the number of instruction streams in the whole computing system can be flexibly adjusted according to the needs, and the number of the instruction streams must be a non-negative integer power of 2, such as the number of computing units controlled by different instruction streams of 1, 2, 4 and the like, and the number of the computing units is the same in pairs and is also a non-negative integer power of 2. The number of instruction streams multiplied by the number of computation units controlled by each instruction stream is thus exactly the number of computation units that the compiler can command, i.e. the number of blocks. It is thus possible to confirm several kinds of division schemes of computing resources, such as 4×4,2×8, etc., based on existing computing resources. Any partition scheme for the computational resources of the compiler-level needs to be assigned a mapping scheme to identify which Sub-array of hardware each Block specifically needs to correspond to under a given partition scheme. The problem of minimizing the cost of data transmission is addressed herein. As shown in fig. 4, the basic principle of implementation of the partitioning scheme in the submodule 1.4 is explained, and three possible partitioning schemes on the software side are shown by taking Sub-array of 4 hardware sides as an example.

Sub-module 1.5: the scheduling scheme and its overhead estimation for each compute node in the computation graph is computed under a certain partitioning scheme and its mapping scheme. The part submodule 1.4 confirms the number of instruction streams in the MIMD system and the number of hardware resources controlled by each instruction stream. The task scheduling algorithm is used to specify the time node, the instruction stream responsible for calculating the data on which the node depends, the time when the node obtains the data from the previous storage unit or other computing units, and the confirmation of the row of the Sub-array to which the different data should be allocated. For example, a heuristic algorithm based on a critical path has been realized, which improves the utilization rate of computing resources as much as possible and reduces the energy consumption and the computing time on the premise of meeting the constraints of the in-memory computing capacity and the synchronous communication mechanism. As shown in fig. 6, the effect of the different scheduling algorithms in the submodule 1.5 is demonstrated, and the uniqueness of the synchronous communication mechanism of the work is demonstrated.

Sub-module 1.6: a linear programming system. A scheduling plan can be made in sub-module 1.5 for all possible divisions and mapping schemes obtained in 1.4 and the cost required to traverse the entire computational graph under the scheduling plan is estimated. However, the maximum parallelism supported by each partitioning scheme is limited, that is, there is an upper limit to the amount of data that can be calculated in parallel under a confirmed partitioning scheme and mapping scheme (that is, the parallelism in the submodule 1.4 is equal to the number of columns of each Sub-array multiplied by the number of blocks controlled by the instruction stream), and if the amount of data that the user needs to calculate is greater than the upper limit, multiple times of traversing the calculation map using different partitioning mapping schemes and scheduling schemes thereof are required. A linear programming system is designed by the cost of each division mapping scheme estimated in the sub-module 1.5, the maximum parallelism which can be supported by the division mapping scheme and the total data quantity which the user needs to calculate, so that the user can finish the calculation task in the hand in a short time as much as possible. As shown in fig. 5, how the sub-modules 1.4, 1.5, 1.6 cooperate with each other to complete task scheduling overall.

Sub-module 1.7: an instruction generation system. The module that the complete calculation scheme of the whole workload is finally obtained according to the results in the sub-modules 1.4, 1.5 and 1.6 is that the final instruction generated by the calculation scheme is equivalently described by using a defined instruction system, can be used for a simulator to verify the result and is translated into equivalent binary codes for the actual execution of hardware in future work.

In addition, module 2-a simulator for verifying the correctness of the above-mentioned compiler, comprising the following sub-modules:

sub-module 2.1: and the test data generation module. The module can randomly generate input data with specified quantity and specified bit width according to the needs of testers so as to carry out subsequent simulation tests.

Sub-module 2.2: and the instruction analysis module. The module can directly read the logic operation instruction generated in the module 1.7 and store it in the memory for calling in the simulation process.

Sub-module 2.3: and an instruction function level simulation module. The module can read in test data written manually by a tester or generated by the sub-module 2.1, load the test data into the sub-module 2.2, and then execute logic calculation instructions in the sub-module 2.2 strictly in sequence. After all the calculation is completed, the simulation result is output to an external file according to the needs of a tester so as to carry out subsequent result comparison and analysis.

Sub-module 2.4: and (5) a result verification tool. The module directly uses other simulation tools (such as Verilog) and takes the Verilog file in the module 1.1 as input to perform simulation calculation on the same test data in the sub-module 2.3, and the test result is compared with the simulation result in the sub-module 2.3 one by one. If the two files are identical, the result correctness verification test proves that the logic operation instruction generated by the compiler is correct; if the two files have differences, the reasons of the differences need to be further analyzed, and the compiler is correspondingly corrected and optimized.

Through the cooperation of the four sub-modules, the module 2 can complete the correctness verification of the whole compiler. The tester can quickly and accurately detect the correctness of the compiler through the module, ensure that the compiler can accurately generate a logic operation instruction in practical application, and improve the reliability and reliability of the compiler. As shown in fig. 7, the improvement of each index brought by the present application is shown.

The beneficial effects of this application lie in:

compared with the existing in-memory computing work, the method and the device have the advantages that the MIMD instruction system and related task scheduling are supported, so that the utilization rate of computing resources and the computing speed are remarkably improved when the computing task load is low. In particular, by a compiler compiling simple logical computation tasks into Boolean operation instructions for execution based on bit-serial in-memory computations, more efficient parallel computation may be achieved. In addition, the MIMD task scheduling basic model and the module dividing scheme based on the bit serial memory internal computing system are formulated, so that task scheduling is more accurate and efficient. When task scheduling is executed, more real scenes such as synchronous communication, storage capacity and the like are considered, and more accurate scheduling results can be generated. The beneficial effects enable the application to have wide application prospects in the in-memory computing field.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A bit-serial-based in-memory computing MIMD Boolean logic compiler, which is characterized by comprising a module 1 and a module 2; the module 1 is a MIMD instruction compiler based on bit serial memory, the module 1 processes Verilog files line by line to generate intermediate representation, describes a logic netlist of a directed acyclic graph, abstracts computing resources to be blocks, divides computing resources according to Sub-array quantity and flexibly adjusts instruction stream quantity, designates a mapping scheme for each division scheme and estimates scheduling overhead; the module 2 is a simulator for verifying the correctness of the compiler, and the module 2 generates test data, analysis instructions, function level simulation and verifies the result.

2. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein said module 1 comprises a Verilog file parsing tool responsible for taking as input a combinational logic Verilog file meeting specific format requirements, analyzing the processing line by line, generating an intermediate representation suitable for subsequent processing.

3. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein the module 1 comprises a directed acyclic graph describing the same logic netlist as the input file.

4. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein the module 1 comprises a computing resource abstraction model that abstracts concrete computing hardware resources into computing units that can be directly used by the compiler.

5. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein the module 1 comprises a partitioning scheme and mapping Sub-module of computing resources that determines the number of computing units blocks that can be dominated by itself based on the number of Sub-array in hardware.

6. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein the module 1 comprises a task scheduling algorithm for specifying a mapping scheme for each partitioning scheme and estimating scheduling overhead on the premise of satisfying constraints of in-memory computing capacity and synchronous communication mechanisms.

7. The in-memory computing MIMD boolean logic compiler based on bit-strings according to claim 1, characterized in that said module 1 comprises a linear programming system that makes a scheduling plan at all possible partitioning and mapping schemes and estimates the cost required to traverse the whole computational graph under the scheduling plan, so that the user completes the computational tasks in his hand in as short a time as possible.

8. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein said module 1 comprises an instruction generation system that equivalently describes the final instructions generated by the computing scheme using a well-defined instruction system for the simulator to verify the results and translate them into equivalent binary code for actual execution by hardware in future operations.

9. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 1, wherein the module 2 comprises: the test data generation module can randomly generate input data with specified quantity and specified bit width according to the needs of testers; the instruction analysis module can directly read the logic operation instruction generated in the module 1 and store the logic operation instruction in a memory; the instruction function level simulation module can load test data through the generated test data, then execute logic calculation instructions strictly according to the sequence, and output simulation results to external files according to actual needs after all calculation is completed.

10. The bit-serial based in-memory computing MIMD boolean logic compiler of claim 9, wherein the module 2 further comprises: the result verification tool directly carries out simulation calculation on the same test data by taking the Verilog file in the module 1 as input through other simulation tools, and compares the test result with the simulation result one by one, and if the two files are identical, the logic operation instruction generated by the compiler is proved to be correct through the result correctness verification test; if the two files have differences, the reasons of the differences need to be further analyzed, and the compiler is correspondingly corrected and optimized.