CN115061693A

CN115061693A - In-memory computing code and micro-architecture optimization method and device and computing equipment

Info

Publication number: CN115061693A
Application number: CN202210990034.4A
Authority: CN
Inventors: 郭克; 卢彦; 孟杰
Original assignee: Uniontech Software Technology Co Ltd
Current assignee: Uniontech Software Technology Co Ltd
Priority date: 2022-08-18
Filing date: 2022-08-18
Publication date: 2022-09-16
Anticipated expiration: 2042-08-18
Also published as: CN115061693B

Abstract

The invention discloses a method, a device and a computing device for optimizing an in-memory computing code and a micro-architecture, belongs to the field of in-memory computing micro-architecture design, and solves the problems of complex and long time-consuming modeling simulation of the existing in-memory computing micro-architecture and optimization process of an executable code. The method comprises the following steps: creating a simulation model of in-memory calculation, registering an instruction for each class of the simulation model, adding support to the instruction in a compiler, creating a deep learning model described by a middle and back end intermediate language of a middle and back end framework in the compilation, interpreting the deep learning model into a micro-architecture configuration set of the simulation model, generating a code set, running the code in the simulator, and determining the optimal micro-architecture configuration and the optimal code according to the running condition. The device comprises a simulation model creating unit, a registering unit, a first compiling unit, a configuration generating unit, a second compiling unit and an operating unit. The invention can be used for the collaborative optimization of in-memory computing code and micro-architecture.

Description

In-memory computing code and micro-architecture optimization method and device and computing equipment

Technical Field

The invention relates to the field of in-memory computing micro-architecture design, in particular to an in-memory computing code and micro-architecture optimization method, device and computing equipment.

Background

The in-memory computation is characterized in that the computation unit is implemented inside the memory device, thereby avoiding frequent and massive data handling during the computation of large data.

The method aims at finding the optimal configuration of the micro-architecture based on in-memory computing for actual deep learning application, needs to count and classify operations generated by deep learning, evaluates hardware consumption of the operations, models according to obtained parameters, instantiates the operations into an architecture simulator, is very complex in process, often needs dozens of people to work for several years, especially is based on novel in-memory computing devices such as memristors and light quantum computing, and has little accumulated data in the development history of the micro-architecture and longer time consumption. Thus, improving the design efficiency of microarchitectures dedicated to in-memory computation for specific deep learning applications remains a significant challenge. Meanwhile, compiler optimization for a specific micro-architecture is also a time-consuming and labor-consuming task, a large amount of data of hardware and application logic needs to be acquired at the same time, and a bottleneck point is obtained through a large amount of analysis, so that the performance of the compiler is improved, and the compiler can output an optimal code sequence which can be executed by the micro-architecture. However, each time the micro-architectural hardware is upgraded, the same work needs to be done again.

S, Rogers, j.Slycord et al designed a special purpose processor for a specific program based on the clang (a lightweight compiler in C, C + +, and Objective-C languages) and LLVM exploration, which provides the fundamental work as a mapping from LLVM IR to hardware data paths and operations, but has many drawbacks for systems that design in-memory computing for deep learning applications. For example, adapting the standard model of the deep learning model to a C program that clang can compile requires a significant amount of manual coding effort. Furthermore, these designs do not support microarchitectural exploration of in-memory computing models and optimization of compiler code generation.

In summary, the above existing modeling and simulation of the in-memory computing micro-architecture and optimization scheme of the executable code have the problems of complex optimization process and long time consumption.

Disclosure of Invention

To this end, the present invention provides an in-memory computing code and micro-architecture optimization method, apparatus and computing device in an effort to solve or at least mitigate at least one of the problems identified above.

According to one aspect of the present invention, there is provided an in-memory computing code and micro-architecture optimization method, comprising: creating a simulation model of in-memory computation; registering an instruction in each class of the simulation model, and increasing the support to the instruction in a compiler; creating a deep learning model described by a mid-back end intermediate language of a compiled mid-back end framework; interpreting said deep-learning model as a set of micro-architectural configurations of said simulation model, each micro-architectural configuration in said set of micro-architectural configurations configured as a simulator; generating a code set capable of running in the simulator by adopting the compiler; and running the codes in the code set in the simulator, and determining the optimal micro-architecture configuration and the optimal codes according to the running condition.

Optionally, in the method for optimizing in-memory computing code and micro-architecture according to the present invention, the creating a simulation model of in-memory computing includes: several classes are created for implementing the computing functionality.

Optionally, in the in-memory computing code and micro-architecture optimization method according to the present invention, the creating a deep learning model described by a mid-back intermediate language of a compiled mid-back framework includes: compiling the deep learning model using the compiler to obtain a deep learning model described by a mid-back intermediate language of a compiled mid-back framework.

Optionally, in the in-memory computing code and micro-architecture optimization method according to the present invention, the method further includes: and mapping an operator corresponding to each class of the simulation model in the interpreter into an instruction registered for the class.

Optionally, in the in-memory computing code and micro-architecture optimization method according to the present invention, interpreting the deep learning model as a micro-architecture configuration set of the simulation model includes: and interpreting the deep learning model into a micro-architecture configuration set of the simulation model through the interpreter.

Optionally, in the in-memory computing code and micro-architecture optimization method according to the present invention, the determining an optimal micro-architecture configuration and an optimal code according to the operating condition includes: taking Cartesian product of the micro-architecture configuration set and the code set; and determining the optimal micro-architecture configuration and the optimal code according to the Cartesian product.

According to another aspect of the present invention, there is also provided an in-memory computing code and micro-architecture optimization apparatus, including: the simulation model creating unit is suitable for creating a simulation model of in-memory calculation; a register unit adapted to register an instruction for each class of the simulation model and to add support to the instruction in a compiler; a first compiling unit adapted to create a deep learning model described by a mid-back end intermediate language of a mid-back end framework under compilation; a configuration generation unit adapted to interpret the deep learning model as a set of micro-architectural configurations of the simulation model, each micro-architectural configuration in the set of micro-architectural configurations being configured as a simulator; a second compiling unit which generates a code set capable of running in the simulator by using the compiler; and the running unit is suitable for running the codes in the code set in the simulator and determining the optimal micro-architecture configuration and the optimal codes according to the running condition.

Optionally, in the in-memory computing code and micro-architecture optimizing apparatus according to the present invention, the apparatus further includes: and the mapping unit is suitable for mapping the operator corresponding to each class of the simulation model in the interpreter into the instruction registered for the class.

According to another aspect of the present invention, there is also provided a computing device comprising: at least one processor and a memory storing program instructions; the program instructions, when read and executed by a processor, cause a computing device to perform the in-memory computing code and microarchitectural optimization method as described above.

According to still another aspect of the present invention, there is also provided a readable storage medium storing program instructions that, when read and executed by a computing device, cause the computing device to perform the in-memory computing code and micro-architecture optimization method as above.

The invention relates to a method, a device and a computing device for optimizing an in-memory computing code and a micro-architecture, which can simultaneously compile GEM 5-based micro-architecture configuration and executable codes of the micro-architecture configuration for a specific deep learning model through a compiler and an interpreter, sequentially run each group of codes on a simulator corresponding to each micro-architecture configuration, select the optimal micro-architecture configuration and codes according to the running condition, and realize the cooperative optimization of the micro-architecture configuration and the codes.

According to the in-memory computing code and micro-architecture optimization method, device and computing equipment, at least one of the following beneficial effects can be realized: the in-memory computation micro-architecture and executable code collaborative optimization of deep learning can be achieved, the optimization process is automatically completed, complex manual operation is not needed, time consumption is short, the CPU structure is used as an optimization object, optimization can be achieved in 2-3 days, and efficiency is high.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 illustrates a flow diagram of a method 200 for in-memory computing code and micro-architecture optimization according to one embodiment of the invention;

FIG. 3 illustrates a schematic diagram of a method 200 according to one embodiment of the invention;

FIG. 4 is a block diagram of an in-memory computing code and micro-architecture optimization apparatus 400 according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Deep learning algorithms are being used by many industries to solve domain-specific problems, and are widely used to solve a series of significant problems such as computer vision, pattern recognition, speech recognition, and natural language processing, and have obtained various breakthroughs in various domains. These breakthroughs are accompanied by huge computational burden and power consumption problems for traditional Graphics Processing Units (GPUs) and Central Processing Units (CPUs). A large amount of time and energy of the GPU and the CPU are consumed in data transfer between different operation units, which greatly limits performance improvement and power consumption reduction space of the GPU and the CPU, and is referred to as a memory wall problem in the industry. The memory wall problem directly drives the industry to have an interest in non-von neumann computers, and the architectures such as memory computing and near memory computing are continuously and widely concerned.

The memory computing overcomes the weaknesses of the traditional CPU and GPU, but brings new system structure exploration problems, and particularly, in the aspect of deep learning computing by using the memory computing structure, a plurality of trade-off designs need to be carried out among micro systems, so that the performance of the memory computing structure is fully exerted.

Aiming at the problems of complex optimization process and long time consumption of the existing optimization method of the in-memory computing micro-architecture and the existing optimization method of the compiler, the invention provides the in-memory computing code and micro-architecture optimization method, which can carry out collaborative optimization on the in-memory computing micro-architecture and the corresponding compiler, and has simple process and short time consumption.

FIG. 1 shows a schematic diagram of a computing device 100, according to one embodiment of the invention. It should be noted that the computing device 100 shown in fig. 1 is only an example, and in practice, the computing device for implementing the in-memory computing code and micro-architecture optimization method of the present invention may be any type of device, and the hardware configuration thereof may be the same as the computing device 100 shown in fig. 1 or different from the computing device 100 shown in fig. 1. In practice, the computing device implementing the in-memory computing code and micro-architecture optimization method of the present invention may add or delete hardware components of the computing device 100 shown in fig. 1, and the present invention does not limit the specific hardware configuration of the computing device.

As shown in fig. 1, the computing device 100 includes a memory 110 and a processor 120, and the memory 110 and the processor 120 communicate with each other via a bus.

Depending on the desired configuration, memory 110 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. In an embodiment in accordance with the invention, memory 110 includes a system architecture modeler (GEM 5-SALAM), a deep learning compiler framework (TVM), a compile middle backend framework (LLVM), and program instructions to perform in-memory compute code and micro-architecture optimization methods. The GEM5 is a software framework based on events for generating a hardware simulator, and the GEM5 realizes the support of various objects and events in a memory model, a CPU model and a hardware system through C + +, Python (a high-level programming language) and Ruby (a simple and rapid object-oriented scripting language) programming, thereby completely and accurately simulating the operation behavior of the hardware system; GEM5-SALAM an extension to GEM5 that adds primarily the functionality to automate the simulation of certain areas of specific hardware and to interpret the standard model described by LLVM IR into a specific GEM5 configuration. A TVM is a deep learning compiler framework that can compile deep learning models into source code that can actually run on a particular architecture. LLVM is a kind of back-end framework in compiling, which can realize the optimization and generation of specific object code.

Depending on the desired configuration, processor 120 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, and the like, as well as a personal computer including desktop and notebook computer configurations. Of course, the computing device 100 may also be implemented as part of a small-sized portable (or mobile) electronic device. In an embodiment in accordance with the invention, computing device 100 is configured to perform an in-memory computing code and micro-architecture optimization method 200 in accordance with the invention.

The in-memory computing code and micro-architecture optimization method 200 according to an embodiment of the present invention comprises: creating a simulation model of in-memory computation; registering an instruction for each class of the simulation model, and increasing the support to the instruction in a compiler; creating a deep learning model described by a mid-back end intermediate language of a compiled mid-back end framework; interpreting the deep learning model into a micro-architecture configuration set of simulation models, wherein each micro-architecture in the micro-architecture configuration set is configured as a simulator; generating a code set capable of running in a simulator by adopting a compiler; and running the codes in the code set in the simulator, and determining the optimal micro-architecture configuration and the optimal codes according to the running condition.

FIG. 2 illustrates a flow diagram of a method 200 for in-memory computing code and micro-architecture optimization according to one embodiment of the invention. Method 200 is performed in a computing device, such as computing device 100 described above. As shown in fig. 2, method 200 begins at 210.

At 210, a simulation model of the in-memory computation is created.

It should be noted that the simulation model here is a GEM5 simulation model of the in-memory computing micro-architecture, and is one of the objects to be optimized by the method 200. The method for creating the GEM5 simulation model comprises the following steps: according to the computing functions that can be implemented by the GEM5 simulation model, several classes are created in GEM5 for implementing the computing functions.

Next, at 220, an instruction is registered in GEM5 for each class of the GEM5 simulation model, and support for the instruction is added within the compiler. When these instructions are executed, the compute function of the corresponding class is called to implement the in-memory computation.

Next, in 230, a deep learning model described by the mid-back end intermediate language of the compiled mid-back end framework, i.e., a deep learning model described by LLVM IR, is created.

Optionally, at 230, a deep learning model produced by TensorFlow, PyTorch, or ONNX may be compiled using a compiler of the TVM, resulting in a deep learning model described by LLVM IR. The TensorFlow is a symbolic mathematical system based on data flow programming, and is widely applied to programming realization of various machine learning algorithms; the PyTorch is an open-source Python machine learning library, is based on the Torch, can be used for application programs such as natural language processing and the like, is a continuous calculation package based on the Python, has powerful tensor calculation accelerated by a GPU and comprises a deep neural network of an automatic derivation system; open Neural Network Exchange (ONNX) is a deep learning development tool ecosystem.

Next, in 240, the deep learning model described by LLVM IR is interpreted as a set of micro-architectural configurations of the GEM5 simulation model, each micro-architectural configuration in the set of micro-architectural configurations acting as a simulator.

It should be noted that, as shown in fig. 3, the simulator here mainly includes a micro-architecture configuration file (representing the configuration of the micro-architecture), a TVM code executor for executing the code generated by the compiler, and respective in-memory calculation models of the GEM5 simulation model.

Optionally, at 240, it may be implemented using LLVM IR interpreter in GEM 5-SALAM. The operators in the LLVM IR interpreter corresponding to each class of the GEM5 simulation model need to be mapped first as instructions registered for that class.

Three operations commonly used in deep learning are general matrix multiplication operation, nonlinear operation after the general matrix multiplication operation, and nonlinear operation, so that the creation of the GEM5 simulation model and the peripheral setting of the GEM5 simulation model can be realized through the following five steps.

1. A simGEMM class is created that inherits from the MemObject class and the Salam class. The simGEMM class is used as a matrix multiplier for realizing general matrix multiplication. Memory members of the simGEMM class represent weight matrices. The Salam class is implemented by GEM5-SALAM, inheriting the Salam class so that the simGEMM class can be registered into the LLVM IR interpreter of GEM5-SALAM to schedule the simGEMM class when generating the circuit. The LLVM IR interpreter is similar to a compiler, with the output being configuration. In addition, two new members need to be added to the simGEMM class: totalWidth and totalHeight which respectively represent the width and the height of a matrix multiplier weight array and reflect the size of a memory. The width and height of the weight matrix currently used for calculation are indicated by width and height, respectively. The values of totalWidth and totalHeight are set at the time of memristor _ init (memory initialization). It should be noted that the product of totalWidth and totalHeight should be equal to totalSpace member of MemObject class, totalSpace represents the total byte length of occupied memory, and width and height should satisfy that width is less than or equal to totalWidth and height is less than or equal to totalHeight, otherwise, it is considered that read-write cross-border occurs and exception handling is triggered. If the read-write border crossing occurs, the debugging function printing matrix content needs to be reloaded to support debugging. Setting an input buffer area of the simGEMM class as 1024 machine word length, setting an output buffer area as 1024 machine word length, writing data into the input buffer area to trigger the matrix calculation operation of the simGEMM class, and writing a calculation result into the output buffer area to trigger interruption after the calculation is finished.

2. A simGEMMRelu class is created, which inherits to the simGEMM class. And the simGEMMRelu heavy-load output function is written into the output buffer area after Relu operation is carried out after calculation is finished. The input of simGEMMRelu is general matrix multiplication, the function of simGEMMRelu is to carry out Relu operation on the result of general matrix multiplication, and the operation result is written into an output buffer area.

3. A reluSim class is created that inherits from the simObjec class and the Salam class. Setting an input buffer area of the reluSim class as 1024 machine word size, setting an output buffer area as 1024 machine word size, performing Relu calculation on an input operation, and writing a calculation result into the output buffer area after the calculation is completed to trigger interruption on an output operation.

4. The simGEMM class, simgemmerrelu class, and relu class are initialized to an instance in GEM5-SALAM, respectively, and three instructions GEMm, gemrelu, and relu are registered in GEM 5. When the GEMm instruction is executed, calling a simGEMM type calculation function; executing a GEMmrelu instruction, and calling a computation function of the simGEMMRelu class; when the relu instruction is executed, a reluSim-like calculation function is called.

5. And respectively mapping a matrix vector multiplication operator, a relu operator and a relu operator of the LLVM IR interpreter into a GEMm instruction, a GEMrelu instruction and a relu instruction after matrix vector multiplication operation. Thus, when the deep learning model described by LLVM IR is interpreted in GEM5-SALAM, the corresponding operators are interpreted as the GEM5 simulation model, instead of using the operators in GEM5-SALAM library.

In order to enable the back end of the compiler to support three instructions, namely, GEMm, gemrelu and relu, the compiler needs to be improved in the following way: the method comprises the steps that support for GEMm, GEMrelu and relu instructions is added in the risc-v processor rear end of a compiler, and GEMm, GEMrelu and relu respectively correspond to computing operations of simGEMM classes, simGEMMRelu classes and reluSim classes, so that the fact that when matrix vector multiplication occurs in LLVM IR, computing functions of simGEMM classes, simGEMMRelu classes and relu classes can enter and be processed is achieved.

The deep learning model described by LLVM IR is interpreted as the micro-architectural configuration of the GEM5 simulation model. The deep learning model is used as the input of the compiler, the compiler can generate the deep learning model described by the LLVM IR, the deep learning model described by the LLVM IR is used as the input of the GEM5-SALAM interpreter, the GEM5-SALAM interpreter can output a configuration file, and the configuration file stores all generated micro-architecture configurations.

Next, at 250, a compiler is employed to generate a set of code capable of running in an emulator.

And taking the deep learning model as the input of a compiler, generating the deep learning model described by the LLVM IR by the compiler, performing LLVM compiling on the deep learning model described by the LLVM IR by using the compiler, and outputting a code set by the compiler. The code set includes several groups of codes, each group of codes having a particular execution order. For example, each set of code includes A, B, C, D, E five sections of code, but the execution order of A, B, C, D, E in each set of code is different, the execution order of the first set of code is A, B, C, D, E and the execution order of the second set of code is A, C, E, B, D.

The principle of the above 210 to 250 is shown in fig. 3. The deep learning model is described in LLVM IR using a compiler. On one hand, compiling the deep learning model described by the LLVM IR into executable codes by utilizing a compiler added with deep learning related instructions; on the other hand, the deep learning model described by LLVM IR is interpreted as a micro-architectural configuration using an interpreter. A configuration file and a group of executable codes are input into the simulator to be operated, the configuration file represents hardware equipment of the micro-architecture, a code executor is used for executing the codes, and the in-memory calculation model is a calculation model of each class in the micro-architecture configuration.

Next, at 260, the code in the code set is run in the simulator, and the optimal microarchitectural configuration and optimal code are determined based on the operating conditions.

The operation condition mainly includes the time of code execution, and this step can select several groups of codes and several simulators to operate according to the actual requirement, and can select the satisfactory micro-architecture configuration and code according to the operation condition, and also can automatically select the optimum micro-architecture configuration and optimum code.

The method for automatically selecting the optimal micro-architecture configuration and the optimal code is as follows: and (4) carrying out Cartesian product on the micro-architecture configuration set and the code set (running each group of codes in each simulator once), and determining the optimal micro-architecture configuration and the optimal codes according to the Cartesian product.

260 enable co-optimization of code order and micro-architectural configuration.

A specific example of the method 200 is given below. Example GEM5 simulation model calculated in-memory with risc-v 64 processor, with residual network resnet18 as the deep learning model. The running platform is local X86, and the operating system is Ubuntu. First the compiler backend of Tvm needs to be configured to LLVM and set the LLVM-derived content to the deep learning model described by LLVM IR. The input of the compiler is a deep learning model, which can be described as a computation graph, the goal of the compiler is to compile the computation graph into a code that can be executed on a hardware device, and after the compiler is configured with LLVM at the back end, the compiler can tell out the deep learning model represented by LLVM IR. The script code that generates the deep learning model of LLVM IR description is as follows:

import TVM # import TVM module

Import TVM front-end processing module relay from TVM import relay #

import numpy as np # import numpy numerical value calculation module

from TVM, consistency, download import _ testdata # import data downloader

# PyTorch imports

import torch # import torch, deep learning framework

import porttorchvision # import deep learning visual module torchvision

model _ name = "resnet18" # model name

model = getattr (torchvision. models, model _ name) # acquisition model

model = model

# We grab the TorchScripted model via tracinginput_shape = [1, 3, 224, 224]

input _ data = torch, randn (input _ shape) # randomly generates one input

script _ model = torch. trace (model, input _ data)

from PIL import Image

img _ url = "https:// github. com/dmlc/mxnet. js/blob/main/data/cat. png raw = true" # image address

img _ path = download _ testdata (img _ url, "cat.png," module = "data") # download image

image = image. open (img _ path). resize ((224, 224)) # opens the downloaded image and resizes to 224 × 224

# Preprocess the image and convert to tensorfrom torchvision

# Below is a format for converting pictures to torchvision

import transforms

my_preprocess = transforms.Compose(

[

transforms.Resize(256),

transforms.CenterCrop(224),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),

])

img = my_preprocess(img)

Expand _ dims (img, 0) # extend dimension to accommodate the format of the pytorech model

input _ name = "input0" # input name

shape _ list = [ (input _ name, img

mod, params = relay from _ directory # converts model into relay IR form (TVM graph optimized intermediate language)

target = tvm.target.target ("LLVM", host = "LLVM") # specifies that the compilation backend is LLVM

dev = tvm cpu (0) # specifies running hardware as cpu0

with TVM.transform.PassContext(opt_level=3):

lib = relay build (mod, target = target, params = params) # compiles the deep learning model into a binary, and after completion of the compilation, a tvm.

After the script is run, the deep learning model described by the LLVM IR can be exported to the file tvm.ll of the directory where the script is executed, that is, exported to the tvm.ll file under the current directory where the interpreter runs, and the result of compiling the deep learning model resnet18 into the LLVM IR is stored in the tvm.ll file. The core cpu is configured to risc-v 64, the cpu operation mode is set to SE (SE is an operation mode of the processor), and how many clocks are consumed by the three instructions are configured according to actual hardware characteristics. Ll. ll files are input into a GEM5-SALAM interpreter, which generates a series of GEM5 simulation model configurations, including a series of GEM5 configuration topological diagrams and various combinations of caching schemes, and a script SALAM. py for automated configuration search.

It is noted that the GEM5 configuration topological graph reflects the configuration of the GEM5 simulation model, each GEM5 configuration topological graph forms a combination with a caching scheme, and each combination is instantiated to form a simulator.

The tvm.ll file is compiled with the LLVM compiler implementing the three in-memory computation instructions described above, in order to obtain the binary code that the emulator can run for the risc-v 64 processor. The resulting binary is a file, which is copied to the directory where GEM5-sal interpreter resides, and named app. Executing the script salam.py, starting to run the apps by each simulator at the moment, outputting a performance file by the script salam.py after the running is finished, and exporting the performance file to a ppa file of the script running directory, wherein the ppa file stores the score of each simulator and sorts the simulators according to the scores, and the scores are related to indexes such as power consumption, area and performance.

And after all the simulators are operated, automatically analyzing the generated ppa files, and selecting some GEM5-SALAM configurations meeting design requirements according to the required power consumption, area and performance indexes. The Ubuntu system is loaded in GEM5, and the configuration files meeting the design requirements are loaded to run again, and library file verification performance data required for running the TVM program is installed.

If the verification result shows that the performance data is not problematic, compiling and running of TVM and autoTVM are carried out on all GEM5 simulation model configurations, the optimal performance codes obtained from different GEM5 simulation model configurations are compared, and the configuration of each GEM5 simulation model and the optimal performance codes of the configuration are stored to serve as the optimal combination of software and hardware cooperative optimization.

In the process, on the premise of fixing codes, different micro-architecture structures calculated in a memory are explored, and the micro-architecture configuration with the best power consumption, area and performance is searched; and on the premise of fixing the micro-architecture configuration and the code set, searching the code generation of the compiler aiming at the memory calculation and finding an optimal code generation strategy.

The method 200 does not need to write a neural network of C codes manually, but utilizes a deep learning model described by a TVM generated LLVM IR as the input of a GEM5-SALAM interpreter to carry out deep learning special chip micro-architecture optimization; the traditional GEM5-SALAM does not support in-memory computing, while the method 200 realizes a simulation model of in-memory computing, and uses a GEM5-SALAM compiler to realize the search and configuration generation of an in-memory computing micro-architecture; the autoTVM optimizes the micro-architectural TVM code generation using statistics of program execution.

The method 200 realizes end-to-end automatic software stack, translates from a model to generate a specific deep learning micro-architecture supporting memory computation, enables the compiling software and the memory computing hardware model and the micro-architecture to be optimized cooperatively, enables specific codes to search for the specific efficient micro-architecture and searches for efficient code generation strategies for the specific computing model and the micro-architecture to be carried out automatically at the same time, fully utilizes the characteristics of automatic execution of a machine to search for the design optimization space of the software and the hardware, and provides a new idea for improving the optimization of the micro-architecture of the deep learning special circuit of the memory computation of various new models and the output codes of a compiler.

Embodiments of the present invention also provide an in-memory computing code and microarchitecture optimization apparatus 400 capable of performing the various steps of the in-memory computing code and microarchitecture optimization method 200 as described above. The above-described in-memory computing code and micro-architecture optimization apparatus 400 is described below with reference to fig. 4.

As shown in fig. 4, the in-memory computing code and micro-architecture optimizing device 400 includes a simulation model creating unit 410, a registering unit 420, a first compiling unit 430, a configuration generating unit 440, a second compiling unit 450, and a running unit 460.

The simulation model creation unit 410 is adapted to create a simulation model for in-memory computation.

The registration unit 420 is adapted to register one instruction for each class of the simulation model and to add support for the instructions within the compiler.

The first compiling unit 430 is adapted to create a deep learning model described by a mid-back end intermediate language of the compiled mid-back end framework.

The configuration generation unit 440 is adapted to interpret the deep learning model as a set of micro-architectural configurations of a simulation model, each micro-architectural configuration in the set of micro-architectural configurations being configured as a simulator.

The second compiling unit 450 generates a code set capable of running in an emulator using a compiler.

The execution unit 460 is adapted to execute the code in the code set in the simulator, determining an optimal micro-architectural configuration and an optimal code according to the execution situation.

According to one implementation, the apparatus 400 further comprises a mapping unit adapted to map operators in the interpreter corresponding to each class of the simulation model to instructions registered for the class.

According to one implementation, simulation model creation unit 410 enables creation of a simulation model by creating several classes for implementing computational functions.

According to one implementation, the first compiling unit 430 compiles the deep learning model using a compiler to obtain a deep learning model described by a mid-back intermediate language of the compiled mid-back framework.

According to one implementation, configuration generation unit 440 interprets the micro-architectural configuration set by the deep learning model as a simulation model through an interpreter.

According to one implementation, run unit 460 determines the optimal microarchitectural configuration and optimal code by: making Cartesian product of the micro-architecture configuration set and the code set; and determining an optimal micro-architectural configuration and an optimal code according to the Cartesian product.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the in-memory computing code and microarchitectural optimization method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose preferred embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: rather, the invention as claimed requires more features than are expressly recited in each claim. Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.

Claims

1. An in-memory computing code and micro-architecture optimization method, comprising:

creating a simulation model of in-memory computing, the simulation model comprising a plurality of classes;

registering an instruction for each class of the simulation model, and adding support to the instruction in a compiler;

creating a deep learning model described by a mid-back end intermediate language of a compiled mid-back end framework;

interpreting the deep learning model as a set of micro-architectural configurations of the simulation model, each micro-architectural configuration in the set of micro-architectural configurations configured as a simulator;

generating, with the compiler, a set of code capable of running in the simulator; and

and running the codes in the code set in the simulator, and determining the optimal micro-architecture configuration and the optimal codes according to the running condition.

2. The method of claim 1, wherein the creating a simulation model of in-memory computation comprises:

several classes are created for implementing the computing functionality.

3. The method of claim 1, wherein the creating a deep learning model described by a mid-back intermediate language of a compiled mid-back framework comprises:

compiling the deep learning model using the compiler to obtain a deep learning model described by a mid-back intermediate language of a compiled mid-back framework.

4. The method of claim 3, further comprising:

mapping an operator in an interpreter corresponding to each class of the simulation model as an instruction registered for the class.

5. The method of claim 4, wherein interpreting the deep learning model as a set of micro-architectural configurations of the simulation model comprises:

interpreting, by the interpreter, the deep learning model as a set of micro-architectural configurations of the simulation model.

6. The method of any of claims 1 to 5, wherein determining the optimal microarchitectural configuration and optimal code based on operating conditions comprises:

and performing Cartesian product on the micro-architecture configuration set and the code set, and determining the optimal micro-architecture configuration and the optimal code according to the Cartesian product.

7. An in-memory computing code and micro-architecture optimization device, the device comprising:

the simulation model creating unit is suitable for creating a simulation model of in-memory calculation;

the registration unit is suitable for registering an instruction for each class of the simulation model and increasing the support of the instruction in a compiler;

a first compiling unit adapted to create a deep learning model described by a mid-back end intermediate language of a mid-back end framework under compilation;

a configuration generation unit adapted to interpret the deep learning model as a set of micro-architectural configurations of the simulation model, each micro-architecture in the set of micro-architectural configurations configured as a simulator;

a second compiling unit that generates a code set that can be run in the simulator with the compiler; and

and the running unit is suitable for running the codes in the code set in the simulator and determining the optimal micro-architecture configuration and the optimal codes according to the running condition.

8. The apparatus of claim 7, further comprising:

a mapping unit adapted to map an operator in the interpreter corresponding to each class of the simulation model as an instruction registered for said class.

9. A computing device, comprising:

at least one processor and a memory storing program instructions;

the program instructions, when read and executed by the processor, cause the computing device to perform the in-memory computing code and micro-architecture optimization method of any of claims 1-6.

10. A readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the in-memory computing code and micro-architecture optimization method of any of claims 1-6.