WO2022078400A1

WO2022078400A1 - Device and method for processing multi-dimensional data, and computer program product

Info

Publication number: WO2022078400A1
Application number: PCT/CN2021/123569
Authority: WO
Inventors: 董守杨; 文渊博; 杨君; 马晓东; 苏振宇; 陈峋宇
Original assignee: 中科寒武纪科技股份有限公司
Priority date: 2020-10-16
Filing date: 2021-10-13
Publication date: 2022-04-21
Also published as: CN114385867A

Abstract

A device and method for processing multi-dimensional data, and an electronic device and a compilation apparatus (1802). The compilation apparatus (1802) may be comprised in a combined processing apparatus (1800). The combined processing apparatus (1800) may also comprise a universal interconnection interface (1804) and other processing apparatuses (1806). The compilation apparatus (1802) interacts with the other processing apparatuses (1806), so as to jointly complete a user-specified computing operation. The combined processing apparatus (1800) may further comprise a storage apparatus (1808). The storage apparatus (1808) is respectively connected to the compilation apparatus (1802) and the other processing apparatuses (1806), and is used for storing data of the compilation apparatus (1802) and the other processing apparatuses (1806).

Description

A device, method and computer program product for processing multidimensional data

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese patent application filed on October 16, 2020, the application number is 2020111120905, and the title is "A device, method and computer program product for processing multidimensional data".

technical field

The present disclosure relates to the field of computers, and more particularly, to the field of processing multidimensional data.

Background technique

Computers equipped with accelerators for increased efficiency have received increasing attention. To take advantage of such computers, there is a huge need for advanced programming architectures to achieve high performance, improve software productivity, and ensure better portability across highly diverse ML architectures. Although existing programming tools have realized the importance of tensor data to reduce the burden of programming (such as TensorFlow and TVM), these programming tools still do not solve the above problems well. For example, in the prior art, a loop statement is generally used to split tensor data into scalar data. However, in this case, the semantics of the tensors are more or less broken during the descent, thus losing potential opportunities for optimization.

SUMMARY OF THE INVENTION

The purpose of the present disclosure is to solve the problem that tensor data needs to be split into scalar data in the prior art, and to provide a method and device capable of retaining tensor primitives in the entire processing process.

According to a first aspect of the present disclosure, there is provided a method for processing multidimensional data, comprising: receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics; parsing the first intermediate representation, and converting the first intermediate representation The intermediate representation is converted into a target program, where the target program preserves multidimensional data semantics.

According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

In the present disclosure, the proposed Tensor Intact Compling (TIC) architecture can improve performance, improve efficiency, and improve portability.

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:

FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure;

Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure;

Figure 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a;

Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure;

Fig. 4a shows a device for processing multi-dimensional data according to an embodiment of the present disclosure; Fig. 4b shows a flowchart of the steps performed by the first processing device according to an embodiment of the present disclosure;

5 shows a schematic diagram of a tensor retention architecture according to another embodiment of the present disclosure;

Figures 6a to 6d show schematic structural diagrams of various neural network accelerators/processors;

7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., thereby forming a multi-dimensional data processor with common features;

Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure;

Figure 7c shows the structure of the multi-layer processor and its functions;

Figure 8a shows an exemplary code diagram of the second intermediate representation TIR;

Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure;

Fig. 9a and Fig. 9b show the traditional neural network operation and the schematic diagram of the neural network after operator fusion;

FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure;

FIG. 11 shows a storage manner of data when there are multiple PSMs according to one embodiment of the present disclosure;

Figures 12a-12d show schematic diagrams of data rotation when multiple PSMs are performing data storage;

Figures 13a and 13b depict schematic diagrams of mapping parallel tasks to parallel processing clusters in virtual processors;

Figure 14 shows the performance of TIC's technology and other benchmark technologies on GPU-TC;

Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU;

Figure 16 shows the performance of TIC's technology and other benchmark technologies on TPU;

Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU;

Figure 18 shows a combined processing device; and

Figure 19 shows an exemplary board.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

A variety of different programming languages have been developed on traditional general computing platforms, including low-level assembly languages for specific hardware architectures (such as X86 assembly language, ARM assembly language, and RISC-V assembly language, etc.) , and the logical expression for logical reasoning becomes the language Prolog, etc. These programming languages face a number of problems with the intelligent accounting systems represented by deep learning processors. There are three gaps between traditional programming languages and computing-only systems: namely, the semantic gap, traditional programming languages are difficult to describe high-level intelligent computing semantics efficiently, resulting in low development efficiency of intelligent applications; the second is the hardware gap, traditional programming languages. It is difficult to efficiently abstract the hardware features of intelligent computers, resulting in low execution efficiency of the final generated code; the third is the platform gap. There are many types of intelligent computing hardware platforms and are constantly growing, and it is difficult to implement cross-platform porting of programs optimized for specific platforms.

FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure. The methods of the embodiments of the present application can be applied to a processor, and a compiler, a compiling component, or a compiling program can run on the processor. A compiler, compiling component, or compiling program may be used to perform at least one step in the method.

As shown in FIG. 1, in order to reduce or eliminate at least one of the above gaps, the method may include: in operation S110, receiving a first intermediate representation, the first intermediate representation has multi-dimensional data semantics; and in operation S130, parsing the A first intermediate representation, and converting the first intermediate representation into a target program, wherein the target program preserves multidimensional data semantics.

As mentioned above, the multi-dimensional data of the present disclosure may include non-scalar data such as vector data, matrix data, and tensor data, and may also be any other higher-dimensional data. The technical solutions of the present disclosure can also process scalar data, which will be described later. However, it should be understood that the following description will mainly take tensor data as an example.

In the above, the semantics of multi-dimensional data are always maintained from the time of receiving to parsing, rather than a cycle in which multi-dimensional data is split into multiple scalar data as in the traditional technology. Therefore, in the solution of the present disclosure, there is no need to first split multi-dimensional data into scalar data, and then combine the scalar data into multi-dimensional data, thereby reducing intermediate conversion processes and improving computing efficiency. In addition, since the multi-dimensional data is always maintained, it is more intuitive for the user, so it is also convenient for the user to edit the data and the above-mentioned structure. Furthermore, the elimination of the semantic gap further improves the efficiency of programming and operations.

The semantics of preserving multi-dimensional data can be implemented by corresponding programming languages, such as programming languages with Conv semantics and Tensor (tensor) types. Generally, compared with traditional languages such as C++, Python, etc., Tensor-type programming languages can greatly reduce the amount of programming and preserve the semantics of multi-dimensional data (such as tensor data).

The target program described above may include high-level languages supported by neural network hardware, such as languages such as CUDA C and BANG C described above. These target programs can also handle multidimensional data and retain multidimensional data semantics.

According to an embodiment of the present disclosure, the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.

In the above, operation information may refer to various information for operating data, such as the structure of the neural network, operators in the neural network, data access operations, data optimization, etc., which describe the data Perform all related operations of processing and computing. The operation information of multidimensional data here still includes the semantics of multidimensional data. The structure of the neural network can refer to the relationship between each operator and other operators in the neural network, the relationship between the input and output of the operators, etc., which describes the overall structure of the neural network. An operator in a neural network can be any information describing an operator, such as the type of the operator, whether the operator is a single operator or a combination operator of multiple single operators, etc.

Data attributes can describe the type of data, such as float type, fix type, and so on. It should be understood that the above-mentioned types are merely examples and not limitations of the present disclosure.

The order of dimensions can be NHWC or NCHW. For images, N represents how many images there are in this batch, H represents how many pixels the image has in the vertical direction, W represents the number of pixels in the horizontal direction, and C represents the number of channels (such as the channels of black and white images). The number C=1, and the number of channels of the RGB color image C=3).

NHWC has better memory access locality (one output pixel can be obtained for every three input pixels), while NCHW must wait for all channel inputs to be ready to obtain the final output result, which requires a large temporary space.

In the present disclosure, a suitable format can be determined according to actual needs, such as the processing capability of the accelerator and the compatible dimensional order.

Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure.

As shown in FIG. 2a, wherein, converting the first intermediate representation into a target program S130 includes: in operation S1310, converting the first intermediate representation into an abstract language representation, wherein the abstract language representation includes multi-dimensional data semantics and, in operation S1330, converting the abstract language representation into the target program.

Figure 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a. For ease of understanding and reading, the architecture of the present disclosure is called Tensor Intact Compling (TIC).

As shown in Figure 2b, the architecture may include: machine learning applications A ₀ , _A ₁ , . Representation, the first intermediate representation as described above); tensor-aware language module TAL (Tensor Aware Language, as described above); tensor abstract machine module TAM (Tensor Abstract Machine); back-end high-level language , such as CUDA C, BANG C, XLA-TPU, etc., wherein, the above-mentioned target program may refer to the high-level language of the back end; and machine learning hardware H ₀ , H ₁ , . . . , H _M-1 .

TIR is an intermediate representation that is designed to meet the needs of machine learning and can represent scalar, vector, matrix, and tensor operations. Therefore, in addition to regular scalar operations (such as arithmetic operations, logical operations, comparison operations, memory operations, function calls, and conditional operations, etc.), TIR can also provide descriptions of vectors, matrices, and tensors, thereby maintaining the semantics of these data.

According to an embodiment of the present disclosure, according to the architecture shown in FIG. 2b, the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.

The traditional intermediate representation can include a graphical intermediate representation (Graph Intermediate Representation, GIR). For example, the graphical intermediate representation can be a computational graph intermediate representation obtained after parsing by the deep learning framework Tensorflow, or a neural network compilation. The middle of the calculation graph of the framework TVM represents NNVM or Relay, etc., which is only used for illustration here, and is not used to limit the scope of this application. In the process of generating the target program from the traditional intermediate representation IR (eg GIR), the original IR is usually split into multiple intermediate representations of scalar computation loops, and then the target program is generated according to the intermediate representations of the scalar computation loops. It can be seen that, in the traditional method, the IR in the form of tensor needs to be converted into the form of scalar first, and then the scalar form is converted into the target program that supports tensor semantics. This way of converting back and forth between tensors and scalars is very verbose and error prone. With the present invention, it is not necessary to convert tensor data into scalar data, but the semantics of tensors can be preserved; in addition, for example, large tensor operations can be converted into small tensor operations, compared with traditional IR operations. Splitting tensor operations into scalar operations in TIR makes TIR more intuitive, which can improve development efficiency for users. In the technical solution of the present disclosure, the semantics of tensors are always maintained, and there is no need to perform conversion between tensors and scalars, thereby improving the efficiency of code compilation and conversion.

Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure.

According to an embodiment of the present disclosure, the second intermediate representation can be obtained by: parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network; according to the operation nodes and the topological connection relationship to obtain the second intermediate representation.

The neural network model file described here can be a Json file, which records the structure, operators and other information of the neural network, and details of the neural network can be obtained from the Json file.

Further, in Figure 3, x is the input data, which is subjected to a convolution operation with the weight data; the intermediate data generated after the convolution operation is added with the data y (bias value bias), and finally the calculation result is obtained, wherein y is the bias data in the convolution operation. It should be understood that Figure 3 is a simplified representation of a neural network, not a limitation of neural networks.

The difference between the GIR and the TIR of the present disclosure is described below from the code level. The codes of Figures 4a and 4b illustrate the difference between the traditional intermediate representation and the TIR representation of the present disclosure.

As shown in Fig. 4a, one tensor data is divided into a plurality of scalar data during calculation, and is represented by, for example, a for loop. Essentially, in the prior art, tensor computation is divided into scalar computation, thus losing the semantics of tensor data. This brings a heavy programming burden to the user and is computationally inefficient.

Figure 4b shows an example of a TIR representation according to one embodiment of the present disclosure.

In Figure 4b, the TIR indicates that the order of dimensions included can be NCHW, and specifies the size of N "batch_size", the value of C "output_channel", and the height "height" and width "width". In addition, the first intermediate representation also includes data "data" and kernel "kernel" and their convolution operation information and data type (eg float 16).

It can be seen that, in the technical solution of the present disclosure, the semantics of the tensor data is preserved, and the shape and operation information of the tensor data are described, which simplifies the operation and improves the programming efficiency.

It should be understood that the semantic scheme of tensor data included in FIG. 4b is not only applicable to tensor data, but also to vector data and matrix data, where vector data is one-dimensional data, and matrix data is two-dimensional data. The programming manner for preserving the semantics of multi-dimensional data is not limited to the scheme shown in FIG. 4b, and any other manners capable of preserving the semantics of multi-dimensional data are included within the scope of the present disclosure.

TAL is an abstract language representation that can be built upon extensions to the C language, taking into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computational units, etc.). These hardware features will be described later when the TAM is introduced. The goal of TAL is to provide users with alternatives that can be easily modified according to their needs. TAL can be converted to code for target platforms, such target platforms are characterized by having native tensor instructions, such as wmma for GPU-TC, etc. The process of converting the TAL into the code of the target platform may include: firstly converting the TAL into a target program, and then compiling the target program into machine instructions that the target platform can run.

The TAM is a programming model provided to users of software programming that contains a basic abstraction for hardware accelerators. The TAM in the embodiments of the present application may abstract common features of various neural network accelerators, and extract various key and common features of tensor processing in various machine learning architectures. Based on this TAM, hardware-aware optimizations can be performed at higher layers, and these features can even be exposed to the user. Since the TAM can be instantiated for different specific platforms (eg GPU-TC), etc., the portability of the system can be significantly improved.

In the present disclosure, TIR, TAL, and TAM supporting TAL can all process multi-dimensional data, and can recognize multi-dimensional semantics. Therefore, in the above conversion process, there is no need to convert multi-dimensional data such as tensor data into scalar data. , but can work directly in the context of multidimensional data, thus reducing or eliminating the need to convert multidimensional data to scalar data and/or convert scalar data to multidimensional data.

It should be understood that TAM is an abstraction of common features of various neural network accelerators, so it can also be regarded as a specific neural network accelerator. Just like a common neural network accelerator can run corresponding machine instructions, this particular neural network accelerator can also run a corresponding target program. It should be understood that the target program here is a general term, which may be a user-editable high-level language. In the present disclosure, TAM may correspond to specific neural network accelerator hardware, and TAL may correspond to the above-mentioned target program.

On top of the intermediate representation (GIR) of the graphical representation in Figure 2b can be a framework, examples of which can include: Caffe (convolutional neural network framework), Tensorflow, Mxnet, Pytorch, PaddlePaddle (Baidu flying pulp), etc. . On top of the framework can be a variety of machine learning applications.

In the above structure, TAM can support the operation of TAL, just like the hardware neural network accelerator supports a specific language, such as GPU-TC supports CUDA C, MLU supports Bang C, and so on.

In the structure shown in Figure 2b, below TAL and TAM are specific hardware neural network accelerators and supported target programs. In the example structure shown in FIG. 2b, the machine learning hardware may exemplarily include H ₀ , H ₁ . . . H _M-1 , and the languages supported by the machine learning hardware may include CUDA C, BANG C, TPU, etc., so The supported languages depend on the specific hardware architecture. In the structure shown in Fig. 2b, the abstract language means that TAL can easily run on various architectures of TAM.

The TIC architecture of one embodiment of the present disclosure is described above in conjunction with Figures 2a-4b. A TIC architecture according to another embodiment of the present disclosure is described below.

According to an embodiment of the present disclosure, converting the first intermediate representation into a target program S130 may include: receiving an abstract language representation, and compiling the abstract language representation into the first intermediate representation; wherein the abstract language representation Indicates editable for the user.

Figure 5 shows a schematic diagram of a tensor-preserving architecture according to the above method.

Similar to the framework of Figure 2b, the architecture shown in Figure 5 may include: machine learning applications A ₀ , A ₁ , . . . , A _M-1 ; framework; graphical representation intermediate representation GIR; Quantity-aware language module TAL; tensor abstraction machine module TAM; back-end high-level machine languages such as CUDA C, BANG C, XLA-TPU, etc.; and machine learning hardware H ₀ , H ₁ , …, H _M-1 .

According to an embodiment of the present disclosure, wherein the abstract linguistic representation may be formed by receiving a second intermediate representation, the second intermediate representation comprising a graphically expressed intermediate representation.

In Figure 5, the multidimensional data from the GIR can be received to form the TAL, and then the abstract transformation representation TAL is transformed into the first intermediate representation TIR, which is similar to the GIR first transformed into TIR and then into TAL shown in Figure 2b is different. In the embodiment shown in FIG. 5 , the information in the TIR can be edited in the TAL first to form the form of subsequent processing. For example, operators derived from GIR can be modified in TAL.

According to a variation of the present disclosure, in Figure 5, the user can directly edit new operators in the TAL without receiving existing operators from the GIR. Users can form new operators that are not in the framework according to their own needs. This makes the architecture provided by the present disclosure more flexible and adaptable.

According to yet another variation of the present disclosure, the TAL in FIG. 5 can also directly receive multi-dimensional data in the frame, such as operators in the frame, etc., without receiving operators from the GIR.

The TAL in Figure 5 can also be built based on extended C language extensions, which also take into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computing units, etc.). The goal of TAL is to provide users with alternatives that can be modified according to the user's needs, such as defining new operators or modifying existing operators for the desired optimization.

It should be understood that, similar to FIG. 2b, as shown in FIG. 5, the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.

In the embodiment of FIG. 5, for the graphical intermediate representation, a portion of the GIR may be converted to TAL, which is then edited by the user, and another portion may be converted to the TIR of the present disclosure. The difference between GIR and TIR has been described above with reference to FIG. 4a and FIG. 4b, and will not be repeated here. Alternatively, the middle of the graph shown in Figure 5 indicates that GIR can all be converted to TIR, and the user-defined operator using TAL can also be converted to TIR. It should be clear that the middle of the graphical representation in FIG. 5 represents the connection between GIR and TAL only schematically expressing a possible implementation, and the connection does not necessarily exist.

In the architecture of Figure 2b and Figure 5, the machine learning hardware can be various hardware, such as GPU, MLU, etc., each hardware has its own programming language, such as CUDA C, Bang C, etc., which are designed for hardware High-level programming languages can serve as backends. TAL can be formed based on CUDA C, Bang C and other programming languages designed for specific hardware, so as to facilitate users to edit. This helps to maximize the performance of the hardware.

The structures of various neural network accelerators are introduced below to facilitate a more detailed description of the TAM later. Figures 6a to 6d show schematic diagrams of structures of various neural network accelerators/processors.

In Figure 6a, the neural network accelerator is a schematic structural diagram of Cambricon-ACC, which includes an I/O interface circuit, a controller, a vector SPM storage, a matrix SPM storage, a vector functional unit VFU, and a matrix MFU. In addition, in order to process scalar data, also includes the scalar functional unit SFU. In the accelerator shown in Figure 6a, the I/O receive ports are connected to the controller, SFU, vector SPM, and matrix SPM, while vector SPM is connected to VFU, and matrix SPM is connected to MFU. The vector SPM is used to store and vector data, and the matrix SPM is used to store the matrix data, the VFU accesses the vector data in the vector SPM, and the MFU accesses the matrix data in the matrix SPM.

Figure 6b is a schematic diagram of a multilayer structure of Figure 6a.

As shown in Figure 6b, the neural network accelerator includes an I/O interface circuit, a controller, a cluster memory, and a plurality of parallel computing components, each computing component includes a plurality of processing units P0-Pn, and each processing unit includes Vector SPM storage, matrix SPM storage, vector functional unit VFU, and matrix MFU shown in 6a. Multiple parallel computing components are connected to the cluster memory, and the cluster memory is connected to the controller. The computing components, the cluster memory, and the controller perform data access through the I/O interface circuit.

Figure 6c shows the structure of a tensor processing unit TPU. In Figure 6c, the TPU includes an I/O interface circuit, a controller, a unified buffer (Unified Buffer), a weight first-in-first-out memory (Weight FIFO), and an arithmetic component. The I/O interface circuit is connected to the controller, the unified buffer, and the weight FIFO memory. The arithmetic components include MMU, activation components, and normalization/pooling components, etc., which are connected to the unified buffer and weight FIFO. memory to access these memories.

Figure 6d shows the structure of a GPU-TC. In Figure 6d, the GPU includes an I/O interface circuit, a controller, and a plurality of computing components. The I/O interface circuit is connected with the controller and each arithmetic component, and the controller is connected with each arithmetic component. Each computing component includes a shared memory and a plurality of tensor processing cores G0-Gn connected to the shared memory.

The structure of the virtual processor TAM obtained by abstracting the above accelerator/processor is described below.

7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., to form a multi-dimensional data processor with common features.

As shown in FIG. 7a, the virtual processor 320 may include an I/O interface circuit 3210, a control circuit 3220 and an operation component 3230, and the operation component may include a first storage circuit 3231 and an operation circuit 3233; the I/O interface circuit 3210 may be configured for input and output of the virtual processor 320; the control circuit 3220 may be configured to perform access operations through the I/O interface circuit 3210; the first storage circuit 3231 may be configured to perform an access operation through the The I/O interface circuit 3210 reads at least input data and weight data; the operation circuit 3233 may be configured to read the input data and weight data from the first storage circuit 3231 for operation. In the above, the control circuit 3220 is connected to the I/O interface circuit 3210 and the arithmetic component 3230 to control the I/O interface circuit 3210 and the arithmetic component 3230. The input and output of the virtual processor 320 can be weight data, input Any input and output of data, intermediate data, instructions, code, etc. After the corresponding content is input to the first storage circuit 3231 , the operation circuit 3233 accesses the first storage circuit 3231 and reads the required content therefrom, and stores the calculated content into the first storage circuit 3231 .

Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure.

As shown in FIG. 7b, the first storage circuit 3231 may include: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data; the operation The circuit 3233 may include a parallel functional unit PFU for performing operations on non-scalar data; the I/O interface circuit 3210 may be connected to the control circuit 3220, PWM and the PNM; the control circuit 3220 is connected to the PWM , the PNM and the PFU are connected; both the PWM and the PNM are connected to the PFU.

In Fig. 7b, the input data and weight data for the neural network operation can be stored in PNM and PWM respectively, so as to facilitate the access of the operation circuit 3233; the parallel functional unit PFU can extract the required data from the PNM and PWM , and can handle vector data, matrix data, and scalar data, as well as higher-dimensional data.

Further, as shown in FIG. 7b, in order to process scalar data, the virtual processor of the present disclosure further includes a scalar functional unit SFU, the SFU is connected to the control circuit and the I/O interface circuit, and is configured to process scalar data. data to operate.

In the above, FIG. 7b introduces the structure and function of the single-layer processor, and the structure and function of the multi-layer processor are described below with reference to FIG. 6c.

As shown in FIG. 7c, the virtual processor further includes a shared memory circuit PSM, and the number of the arithmetic components is multiple; the PSM is configured to read input data and weight data through the I/O interface circuit ; the plurality of operation components are connected in parallel to the PSM, and are configured to read the input data and weight data from the shared memory circuit for operation.

In Fig. 7c, the shared memory circuit PSM is connected to multiple operation components, the input data, weight data, etc. required by the operation components are first stored in the PSM, and then these operation components obtain these data from the PSM. The PSM is visible to the user, and the user can explicitly manage the PSM.

The single-layer TAM structure shown in Fig. 7b can be mapped to the single-layer structure shown in Fig. 6a and Fig. 6c; the double-layer TAM structure shown in Fig. 7c can be mapped to the single-layer TAM structure shown in Fig. 6b and 6d in the two-layer structure. It should be understood that the structures shown in Figs. 7a to 7c are only an example, and the structures of any other neural network accelerators can also be abstracted. Furthermore, the TAM does not remain unchanged, but can be changed according to the hardware structure to be abstracted.

It should be understood that although the single-layer processor and the two-layer processor structure are described above, those skilled in the art can abstract more layers of processors.

The following takes the convolution (conv) operation (Conv=input data*weight data) as an example to illustrate the flow order of data in the TAM. In a single-tier processor structure, the operation is as follows:

A1) First, the control circuit reads the program instruction from the external storage DRAM;

B1) Next, read input data from external storage DRAM into PNM, and read weight data into PWM;

C1) PFU reads input data and weight data from internal memory PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;

D1) Write the intermediate result of this operation on the PNM back to the external storage DRAM.

In the whole operation process, the flow of input neuron data is DRAM->PNM->PFU->PNM->DRAM.

In the two-layer structure, the operation process is as follows:

A2) The control circuit reads the program instructions from the external storage DRAM;

B2) read input data and weight data from external storage DRAM into PSM;

C2) according to the task division of multi-core, read the data in PSM into PNM, PWM;

D2) PFU reads input data and weight data from internal storage PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;

E2) the intermediate result of this operation on the PNM is written back to the PSM of the shared storage;

F2) After the operation results of the multiple processing cores are all written back to the PSM, the results of all the operations are written back to the external storage DRAM.

The difference between GIR and TIR has been introduced above. Next, the process of converting Graph IR to TIR will be introduced in combination with hardware features.

In general, in the process of converting Graph IR to TIR, data splitting and data scheduling are required, such as splitting the data to store it in the corresponding memory. In this embodiment of the present application, the process of converting Graph IR to TIR may include: splitting the multidimensional data represented by the second intermediate into at least one sub-multidimensional data, The categories include input data and weights), and the operations that each sub-multidimensional data needs to participate in, determine the storage space of each sub-multidimensional data; generate space allocation instructions, memory access instructions, and operation instructions related to the sub-multidimensional data. Wherein, the space allocation instruction, memory access instruction and operation instruction related to the sub-multidimensional data are the instructions defined by the first intermediate representation.

Figure 8a shows an exemplary code diagram of the first intermediate representation TIR.

As shown in Figure 8a, the data format of input data x, weight data weight and temporary data Temp is specified,

Produce(){

Tensor(fp32)<NCHW>(1,3,224,224)x: The data type of x is float32, the format is MCHW, and the size is (1,3,224,224)

Tensor(fp32)<NCHW>(64,3,3,3)y: The data type of y is float32, the format is MCHW, and the size is (64,3,3,3)

Tensor(fp32)<NCHW>(1,64,224,224)Temp: The data type of Temp is float32, the format is MCHW, and the size is (1,64,224,224).

TIR also provides space allocation instructions related to the storage space allocation of input data x, weight data weight, temporary data Temp, bias data y, and result data Result, etc., for example:

allocate.pnm x: store data x into pnm memory

allocate.pwm Weight: Store the data Weight in the pwm memory

allocate.pnm Temp: store data Temp into pnm memory

allocate.pnm Y: store data Y into pnm memory

allocate.pnm Result: Store the data Result in the pnm memory

In addition, TIR also gives access instructions to load data from off-chip memory (such as GDRAM) to on-chip memory, such as the instruction load x.gdram to x.pnm; TIR also gives a description of the operation. Operation instructions, such as instructions

Conv(x.pnm, Weight.pwm, Temp.pnm).

Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure.

As shown in Fig. 8b, converting the second intermediate representation into the first intermediate representation may include: in operation S810, splitting the multidimensional data of the second intermediate representation according to the size of the BBM (Building Block of Memory) For a plurality of sub-multidimensional data, a space allocation instruction and a memory fetch instruction are generated to instruct to apply for corresponding storage space in the shared memory circuit PSM according to the space allocation instruction, and the plurality of sub-multidimensional data are divided into multiple slices according to the memory fetch instruction. The external memory is loaded into the shared storage circuit PSM; in operation S820, the sub-multidimensional data in the PSM is split according to the calculation basic block BBC (Building Block of Computation), and a space allocation instruction (as shown in the allocate in Figure 8a) is generated. .pnm x; allocate.pwm Weight) and memory fetch instructions (load x.gdram to x.pnm; load Weight.gdram to Weight.pwm in Figure 8a) to indicate that the corresponding storage space ( Apply for the corresponding storage space in the parallel neuron memory PNM and/or the parallel weight memory PWM), and load the input data in the split sub-multidimensional data into the parallel neuron memory PNM according to the corresponding memory access instructions (allocate.pnm x), load the weight data into the parallel weight memory PWM (allocate.pwm Weight); in operation S830, according to the operation instruction (Conv(x.pnm, Weight.pwm, Temp.pnm) in Figure 8a) After the input data and the weight data are operated, the intermediate result is obtained, and the intermediate result is stored in the PSM; And in operation S840, according to the memory access instruction (store Result.pnm to Result in Figure 8a) .gdram) to store the intermediate result in the off-chip memory.

An exemplary description of the conversion of the second intermediate representation to the first intermediate representation with PSM was presented above. In the single-level structure, there is no PSM. In this case, the input data is directly loaded into the parallel neuron memory PNM, and the weight data is loaded into the parallel weight memory PWM.

Figure 8b shows the flow chart of GIR conversion to TIR. In the above operation, two basic blocks can be defined, namely, the storage basic block and the calculation basic block; wherein, the storage basic block BBM can be the smallest granularity of data copy, The size of the storage basic block is determined by the shared memory circuit PSM; the computing basic block can be the smallest granularity of the operation, and the size of the computing basic block is also limited by the constraints of PNM, PWM and the computing instruction of the current operation. First, the compiler can split the off-chip memory such as DRAM to the PSM dimension according to the original data and the size of the storage basic block, that is, load the original input data tensor A on the DRAM and only load the BBM to the PSM each time. The number of times loaded is tensor A/BBM times.

After that, the compiler can split PSM→PWM, PSM→PNM according to the storage basic block BBM and the calculation basic block. After the data is loaded into the PNM and PWM in blocks, the operation can be completed through the operation instructions defined by the TIR. Next, the result after the PFU calculation is completed is stored in the PSM according to the current split position, and then the PSM is stored in the DRAM according to the current split position.

It should be clear that Figures 8a and 8b are only used to illustrate the conversion process of the intermediate representation, and the instructions contained in the above-mentioned intermediate representation are only a form of intermediate representation, not hardware instructions that can be executed by the processor. The operation process shown in the flowchart of FIG. 8b is only used to illustrate the role of the intermediate representation, and is not used to limit the specific operation process. For example, to implement the convolution operation process shown in Figure 8a, a processor needs to convert the intermediate representation shown in Figure 8a into specific machine instructions, and the processor can implement the above-mentioned operation process according to the specific machine instructions.

According to an embodiment of the present disclosure, the compiler may further optimize the first intermediate representation to generate an object program according to the optimized first intermediate representation. Various embodiments for optimizing the first intermediate representation will be described in detail below.

Another advantage of TIR is that the provided architecture enables potential optimization operations in a tensor-semantic manner. Traditional graphical intermediate representations (eg Relay) can only be optimized for graphics. The optimization scheme according to the embodiment of the present disclosure is given as follows.

According to an embodiment of the present disclosure, optimizing the first intermediate representation by the compiler may include: converting the first dimension order of the multidimensional data to the second dimension order to adapt to the corresponding neural network accelerator.

The above optimizations are mainly for GPU operations. Since the tensor data in TensorFlow is in NHWC format by default, and it is more efficient to use NVHW in GPU, two transformation nodes can be used when optimizing, namely the NHWC to NCHW transformation node, and the NCHW to NHWC transformation node , the transitions occurring between consecutive NHWC-to-NCHW transition nodes and NCHW-to-NHWC transition nodes between two consecutive GPU compute nodes can cancel each other out.

According to an embodiment of the present disclosure, optimizing the first intermediate representation by the compiler may further include: performing operator fusion on the first operator and the second operator. Among them, operator fusion can also be called layer fusion, which can fuse multiple layers in the neural network to compile and generate instructions, reduce the number of accesses to off-chip memory, and thus improve data throughput.

For example, according to an embodiment of the present disclosure, after the first operator and the second operator are operator fused, the calculation process of the first operator and the second operator is as follows: The intermediate results between the two operators are stored in the on-chip memory, so that the second operator can read data from the on-chip memory.

Figures 9a and 9b are schematic diagrams showing traditional neural network operations and a neural network after operator fusion.

As shown in Figure 9a, the data access process in the traditional neural network operation is as follows:

1. Read the input data of the entire calculation graph (that is, the input of the first operator) from DRAM to PNM, and read the weight data of the first operator into PWM;

2. The PFU operation reads the input data and weight data from PNM and PWM to complete the operation, and writes the result of the first operator back to the PNM;

3. Write the result of the first operator from PNM back to DRAM as the input of the second operator;

4. Read the input data of the second operator from DRAM to PNM, and read the weight data of the second operator into PWM;

5. PFU reads data from PNM and PWM to complete the operation, and writes the result of the second operator back to PNM;

6. Write the result of the second operator back to the DRAM as the output of the entire calculation graph.

In a traditional neural network, the intermediate result between two operators needs to be stored in off-chip memory, and when the next operator performs an operation, the intermediate result needs to be read from the off-chip memory, which will cause each operation All need to read data from the off-chip memory, which obviously reduces the operation speed of the entire neural network, and the data access operation to the off-chip memory is also likely to become a bottleneck for the operation speed of the neural network.

In the present invention, as shown in FIG. 9b, the intermediate result can be stored in the on-chip memory (eg, SRAM), thereby improving the data access speed. According to an embodiment of the present disclosure, the first processor performing operator fusion on the first operator and the second operator includes:

1. Write the first input data of the first operator into the PNM;

2. Write the first weight data of the first operator and the second weight data of the second operator into the PWM;

3. Calculate and obtain the first calculation result according to the first input data and the first weight data, and write the first calculation result into the PNM;

4. Calculate and obtain a second operation result according to the first operation result and the second weight data, and write the second operation result into the PNM.

A method of operator fusion is described above. It should be understood that the intermediate results may also be larger than the capacity of the PNM, so that it is impossible to store all the intermediate results in the PNM at one time.

According to the first embodiment of the present disclosure, if the first operation result of the first input data of the first operator and the first weight data of the two operators is greater than the capacity of the neuron memory PNM, the the first weight data to form a plurality of first sub-weight data, so that the first sub-operation result of the first input data and the first sub-weight data is smaller than the capacity of the PNM; The data and the plurality of first sub-weight data are operated in turn, and each time a first sub-operation result is obtained, the first sub-operation result is stored in the PNM, so that the second operator can perform this operation in the second operator. The first sub-operation result is operated to obtain the second sub-operation result.

FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure.

As shown in Figure 10, the far left shows the input data, and the intermediate result in Figure 10 is assumed to be larger than the capacity of the PNM, so it will not be possible to store the intermediate result in the on-chip memory. In this case, the weight data may be split, for example, into weight data 1 and weight data 2 . Weight data 1 is represented by light squares, while weight data 2 is represented by dark squares.

Thus, the input data is first subjected to a convolution operation (first operator) with the weight data 1, as shown by the route 1 in FIG. 10 . The intermediate data generated by the convolution operation (shown as the light-colored square in the intermediate result) is stored in on-chip memory. Then, the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 2), and the obtained output result is stored in the off-chip memory.

Next, a convolution operation is performed on the input data and the weight data 2, as shown in the route 3 of FIG. 10 . The intermediate data generated by the convolution operation (shown as dark squares in the intermediate result) is stored in on-chip memory. Then, the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 4), and the obtained output structure is stored on the off-chip memory.

It can be seen from FIG. 10 and the above description that in this embodiment, the intermediate result is not generated at one time, but is formed by dividing the weight data into blocks. Each generated intermediate structure can be stored in the on-chip memory, and the second operator does not need to read data from the off-chip memory, thus reducing the number of accesses to the off-chip memory and improving the operation efficiency.

Fig. 11 shows a data storage manner when there are multiple PSMs according to an embodiment of the present disclosure, and Figs. 12a-12d show schematic diagrams of data rotation when multiple PSMs store data.

According to an embodiment of the present disclosure, when there are multiple PSMs, the first processor is further configured to: in operation S1110, perform multiple rotation storage of multiple sets of weight data in the multiple PSMs; In operation S1120, the weight data in the plurality of PSMs are operated upon every rotation; in operation S1130, after the rotation of the weight data in all the PSMs is completed, new weight data is read into the plurality of PSMs. in PSM.

The memory of the PFU (eg PNM and PWM) is relatively close to the computational unit, so this memory needs to be used carefully to avoid execution pipeline stalls. More specifically, the programmer needs to calculate the size of the on-chip buffer required for the computation. If the size required for one calculation exceeds the size of the PFU memory, the compilation process will stop with an error message. The method to improve the utilization of PWM is given below, that is, by storing the weight data in the PWM, multiple convolution operations can be performed on different parts of the input data. Compared with the traditional method, the off-chip memory needs to be accessed continuously. The efficiency of the method of the present disclosure is improved by a factor of 1.6.

Specifically, when a processor or chip has multiple processing clusters, there may be multiple PSM memories, and each PSM memory can read corresponding data, such as weight data, from off-chip memory (eg, DRAM). In Figure 12a to Figure 12d, four PSMs are shown, namely PSM0, PSM1, PSM2 and PSM3, these PSMs can store four sets of weight data, respectively expressed as weight data A, weight data B, weight data Data C and weight data D.

First, as shown in FIG. 12a, weight data A is stored in PSM0, weight data B is stored in PSM1, weight data C is stored in PSM2, and weight data D is stored in PSM3. After the PSM stores the above four sets of weight data, these weight data can be operated on with the input data.

After the four groups of weight data are calculated, the four groups of weight data can be stored in the PSM alternately. In Figure 12b, after the rotation, the weight data A is transferred from PSM0 to PSM1, the weight data B is transferred from PSM1 to PSM2, the weight data C is transferred from PSM2 to PSM3, and the weight data is transferred from PSM2 to PSM3. Data D is transferred from PSM3 to PSM0.

After the above-mentioned weight data D, weight data A, weight data B and weight data C are calculated, the next rotation is performed.

In Figure 12c, weight data A is transferred from PSM1 to PSM2, weight data B is transferred from PSM2 to PSM3, weight data C is transferred from PSM3 to PSM0, and weight data D is transferred from PSM0 Transfer to PSM1.

In Figure 12d, weight data A is transferred from PSM2 to PSM3, weight data B is transferred from PSM3 to PSM0, weight data C is transferred from PSM0 to PSM1, and weight data D is transferred from PSM1 Transfer to PSM2.

During these three iterations (ie, rotation), there is no need to communicate with off-chip memory. After three iterations, that is, three rotations, new weight data needs to be reloaded. In this case, it can be obtained from off-chip memory such as DRAM Read new weight data.

The user can implement the approach shown in Figures 11 and 12a-12d by editing the TAL, which partially compensates for the accesses to the PFU memory (e.g. 10 clock cycles) with a compromised access latency. And the gap between accesses to off-chip memory DRAM (eg 300 clock cycles).

According to one embodiment of the present disclosure, processing core synchronization, processing cluster synchronization, and/or chip synchronization can be performed at the TAL; and/or parallel tasks are mapped to parallel processing clusters in virtual processors.

Similarly, the control logic of this embodiment can also be realized by editing the TAL by the user.

The main purpose of exposing control logic at the TAL level is to provide functional correctness and execution efficiency, the main feature of which is the synchronization and parallelization of a large number of computational units. There are three types of synchronization, processing core synchronization, processing cluster synchronization, and/or chip synchronization. Processing core synchronization is to ensure the correctness of parallel execution of pipelines of different functional units such as scalars, vectors, and matrices. The processing cluster includes multiple processing cores, and the processing cluster synchronization is to maintain the synchronization of all processors in a specific cluster in the same space. Chip synchronization ensures that all clusters will continue to execute, disposing of all clusters reaching the synchronization point. It should be noted that the user can hide the processing core synchronization to simplify the programming burden of the programmer. One potential optimization method is software pipelining, which can be used to hide the latency of memory accesses.

The following describes the mapping of parallel tasks to parallel processing clusters in virtual processors in conjunction with Figures 13a and 13b.

As shown in Fig. 13a, suppose that the kernel of the host has tasks TaskDim.x=2, TaskDim.y=4, which means that 2 processing clusters are required, and each processing cluster requires 4 processing cores. Such a task can be mapped to 2 processing clusters in TAM i.e. processing cluster 0 and processing cluster 1, each processing cluster has 4 processor cores i.e. processing core 0, processing core 1, processing core 2 and processing core 3.

As mentioned above, TAM is used to abstract common features of multiple neural network accelerators, which extract various key and common features of tensor processing in various ML architectures. Therefore, when a task is mapped into a TAM, the TAM can be further mapped to specific hardware accelerators.

Thus, the above-mentioned parallel tasks can be mapped to specific hardware accelerators according to the underlying hardware structure.

As shown in Figure 13b, GPU-TC and Cambricon-CC are taken as examples to illustrate. In GPU-TC, two SMs can be used. While each SM may include, for example, 8 tensor processing cores, in this task, each stream processor SM only needs to use 4 tensor processing cores.

In Cambricon-CC, two processing clusters can be used, namely processing cluster 0 and processing cluster 1, each processing cluster has 4 processing cores, so in this task, each processing cluster can use 4 processing cores.

When the task requires more processing clusters or processing cores, it can be run in a time-sharing manner, that is, the task can be divided into multiple executions.

The above control logic can also be implemented in TAL, so that users can customize it according to actual needs.

After the above-mentioned conversion of the intermediate representation, optimization, storage allocation, and abstraction of accelerator hardware, the generated code can be converted into an object program adapted to specific hardware.

For example, when the underlying hardware is MLU, the target program can be Bang C language to adapt to the MLU hardware; when the underlying hardware is GPU-TC, the target program can be CUDA C language. It can be understood that when the underlying hardware is other accelerators, the target program can be converted into machine instructions suitable for the accelerator.

The above-mentioned technical solutions of the present disclosure, since the TAM can be instantiated into different specific platforms (such as MLU, TPU, GPU-TC), etc., are applicable to various accelerators, which significantly improves the portability of the system and is beneficial to the Porting on various accelerator hardware platforms.

To verify the technical solution of TIC, three types of machine learning computers are used, namely GPU with tensor processing core (GPU-TC), MLU and TPU. The neural network algorithms used for evaluation come from different application scenarios, including ResNet-50 and VGG16. ResNet and VGG are not only used for image classification, but also as backbone networks for general feature extraction.

This application compares against three benchmarks. The first benchmark is the TVM stack, which directly supports GPU-TC, MLU, and TPU by rewriting tensor primitives. The second benchmark is Glow IR, which includes two layers of IR, a high-level graphics IR (mainly for graphics optimization) and a low-level instruction IR (mainly for memory-related optimizations). The third benchmark is the TensorFlow framework. The above benchmarks were originally designed primarily for CPUs and GPUs (without tensor processing cores), but were modified according to the technical solutions of the present application to support GPU-TC, MLU, and TPU.

The experimental and comparative results mainly include three aspects: performance, efficiency and portability. It will be introduced in detail below.

1. Performance

GPU-TC: Figure 14 shows the performance of TIC’s technique and other benchmark techniques on GPU-TC, where execution latency is normalized to TensorFlow’s latency. Compared to the programming architecture of TensorFlow, the average performance improvement of TIC is about 201%. The main reason is that unnecessary architectural overhead is avoided and multiple optimizations are made at TAL. The average performance gain over Glow is around 34.8% because Glow's backend is implemented directly through CUDA C rather than through a highly optimized library. Compared to TVM, the average performance gain is about 13.7%. The beneficial effects mainly come from two optimization processes, namely, the sequential optimization of data dimensions with the help of TIR and the optimization of operator fusion.

MLU: Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU. Compared to TensorFlow's programming architecture, the average performance of TIC is about 96.9% of TensorFlow's performance. The main reason is that TensorFlow for MLU runs on a highly optimized library, and more optimization measures can be applied to the technical solution of TIC. For example, in ResNet50, the performance improvement of the disclosed technical solution is about 41.4% compared to the performance of TensorFlow, because several customized optimizations are performed on the technology of TIC. Compared to Glow and TVM, the performance gains are about 23.5% and 20.7%, respectively, which well demonstrates the efficiency of TIC's technology as a compilation architecture.

TPU-Lite: Figure 16 shows the performance of TIC's technique and other benchmark techniques on TPU, while the original Glow cannot perform on TPU-Lite. Due to the relatively coarse granularity of the considered TPU primitives, the optimization that can be performed on the technical solution of TIC is very limited. Therefore, the performance is relatively close for different implementations.

2. Efficiency

In the technical scheme of TIC, the efficiency can be evaluated from different perspectives. From the perspective of using programming architecture to build ML applications, the efficiency of TIC's technical solution is the same as other benchmarks due to the preservation of the programming interface. From the point of view of using TIR and TAL to build new operations, there is a significant increase in efficiency because the semantics of tensor data is preserved from the graph node to the TAM's hardware all the time.

Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU. It can be clearly seen that LoC drops by 43% and 38% on GPU-TC and MUL, respectively. From the perspective of using TAL to directly architect ML applications, developing applications using standard C/C++ can exhibit high efficiency. An obvious advantage is that many ready-to-use applications written in C/C++ can be directly converted to TAL without tensor-related optimizations.

3. Portability

Use quantitative metrics to evaluate portability, an approach described in Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance&precision. In IEEE International Parallel and Distributed Processing Symposium Workshops( IPDPSW), pages 522–531. IEEE.

Table 1 below shows a comparison of the performance consistency achieved using TensorFlow, TVM and the TIC of the present disclosure. Quantitatively, the performance of the TIC of the present disclosure is improved by 25% and 15.4%, respectively, compared with TensorFlow and TVM.

架构Architecture	可移植性portability
TensorFlowTensorFlow	0.6219％0.6219%
TVMTVM	0.6743％0.6743%
TICTIC	0.7784％0.7784%

Table 1

In the present disclosure, a Tensor Preserving Compilation (TIC) architecture is proposed for improving performance, efficiency, and portability. The idea of building TIC is to preserve tensor semantics throughout the compilation process, i.e. from the upper-level programming interface to the lower-level intermediate representation and various languages, and even to the tensor-related instructions of the lower-level hardware platform. The whole TIC architecture can preferably include three components, namely the tensor abstract machine module TAM, the tensor-aware language module TAL and the tensor intermediate expression module TIR, which are mainly used to solve the problems of portability, performance and efficiency, respectively. Programmers can use the programming framework or directly use the TAL to act on the TIC, which will be mutated into an optimized TAK on the TAM, and even the TAL will be compiled into binaries for different target platforms. Experimental results show that TIC outperforms the state-of-the-art in performance, portability, and efficiency on GPU-TC, TPU and MLU.

Embodiments of the present disclosure also provide an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are processed by the one or more processors When the controller runs, the electronic device is caused to perform the method as described above.

According to another embodiment of the present disclosure, there is also provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The above-mentioned method and apparatus can also be implemented as a compiling apparatus, and the compiling apparatus can constitute a combined processing apparatus.

FIG. 18 shows a combined processing device 1800 , which includes the above-mentioned compiling device 1802 , a general interconnection interface 1804 , and other processing devices 1806 . The compiling apparatus according to the present disclosure interacts with other processing apparatuses to jointly complete the operation specified by the user. Figure 18 is a schematic diagram of a combined treatment device.

The compiling apparatus can be implemented in various ways such as software and hardware, and it can run on any one or more of general-purpose/special-purpose processors such as CPU, graphics processor, GPU, and neural network processor.

Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.

A universal interconnection interface for transferring data and control instructions between a compiling device (including, for example, a machine learning computing device) and other processing devices. The compiling device obtains the required input data from other processing devices and writes it into the storage device on the compiling device chip; it can obtain control instructions from other processing devices and write it into the control cache on the compiling device chip; it can also read the compiling device on-chip The data in the storage module is transmitted to other processing devices.

Optionally, the structure may further include a storage device 1808, and the storage device is respectively connected to the compiling device and the other processing device. The storage device is used to save the data in the compiling device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the compiling device or other processing devices.

The combined processing device can be used as a SOC system on a chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

In some embodiments, the present disclosure also discloses a chip, which includes the above-mentioned compiling apparatus or combined processing apparatus.

In some embodiments, the present disclosure also discloses a board including the above chip. Referring to FIG. 19, an exemplary board card is provided. In addition to the above-mentioned chip 1902, the above board card may also include other supporting components, including but not limited to: a storage device 1904, an interface device 1906 and a control Device 1908.

The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 1910 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.

The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 1912 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.

The control device is electrically connected to the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit, MCU). For example, the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. As in accordance with the present disclosure, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.

The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server or a network device) etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

The embodiments of the present disclosure have been introduced in detail above, and the principles and implementations of the present disclosure are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this description should not be construed as a limitation on the present disclosure.

The technical solutions of the present disclosure can be better understood through the following terms:

Clause 1. A method of processing multidimensional data, comprising:

receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics;

The first intermediate representation is parsed, and the first intermediate representation is converted into a target program, wherein the target program preserves multidimensional data semantics.

Clause 2. The method of Clause 1, wherein the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.

Clause 3. The method of

clause

1 or 2, further comprising:

An abstract language representation is received and compiled into the first intermediate representation; wherein the abstract language representation is editable by a user.

Clause 4. The method of clause 3, wherein a second intermediate representation is received to form the abstract linguistic representation, the second intermediate representation comprising a graphically expressed intermediate representation.

Clause 5. The method of

clause

1 or 2, converting the first intermediate representation into a target program comprising:

converting the first intermediate representation to an abstract language representation, wherein the abstract language representation contains multidimensional data semantics; and

The abstract language representation is converted into the object program.

Clause 6. The method of any of clauses 1-5, further comprising: receiving a second intermediate representation and converting the second intermediate representation to the first intermediate representation;

Wherein, the second intermediate representation includes a graphically expressed intermediate representation.

Clause 7. The method of clause 4 or 6, further comprising:

Parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network;

The second intermediate representation is obtained according to the operation node and the topological connection relationship.

Clause 8. The method of any of clauses 1-7, further comprising: optimizing the first intermediate representation to generate an object program from the optimized first intermediate representation.

Clause 9. The method of Clause 8, wherein optimizing the first intermediate representation comprises:

Converting the order of the first dimension of the multi-dimensional data to the order of the second dimension so as to adapt to the corresponding neural network accelerator; and/or performing operator fusion of the first operator and the second operator.

Clause 10. The method of any of clauses 1-9, wherein the abstract language representation is formed based on a virtual processor comprising:

An I/O interface circuit, a control circuit and an arithmetic component, the arithmetic component includes a first storage circuit and an arithmetic circuit;

the I/O interface circuit is configured for input and output of the virtual processor;

the control circuit is configured to perform an access operation through the I/O interface circuit;

The first storage circuit is configured to read at least input data and weight data through the I/O interface circuit;

The operation circuit is configured to read the input data and weight data from the first storage circuit for operation.

Clause 11. The method of clause 10, wherein,

The first storage circuit includes: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data;

The operation circuit includes a parallel functional unit PFU for performing operations on non-scalar data;

The I/O interface circuit is connected to the control circuit, the PWM and the PNM; the control circuit is connected to the PWM, the PNM and the PFU; the PWM and the PNM are both connected to the PFU.

Clause 12. The method of clause 10 or 11, wherein the virtual processor further comprises a scalar functional unit SFU, the SFU connected to the control circuit and the I/O interface circuit, configured to process scalar data perform operations.

Clause 13. The method of any one of clauses 10-12, wherein the virtual processor further comprises a shared memory circuit PSM, and the number of the arithmetic components is plural;

The PSM is configured to read input data and weight data through the I/O interface circuit;

The plurality of operation components are connected in parallel to the PSM and are configured to read the input data and weight data from the shared memory circuit for operation.

Clause 14. The method of clause 6, wherein converting the second intermediate representation to the first intermediate representation comprises:

The multi-dimensional data of the second intermediate representation is divided into a plurality of sub-multi-dimensional data according to the size of the storage basic block BBM, and the plurality of sub-multi-dimensional data is divided into the shared storage circuit PSM from the off-chip memory multiple times;

Split the sub-multidimensional data in the PSM according to the calculation basic block BBC, load the input data in the split sub-multidimensional data into the parallel neuron memory PNM, and load the weight data into the parallel weight memory PWM;

After the input data and the weight data are operated, an intermediate result is obtained, and the intermediate result is stored in the PSM; and

The intermediate results are stored in the off-chip memory.

Clause 15. An electronic device comprising:

one or more processors; and

a memory having computer-executable instructions stored therein, which, when executed by the one or more processors, cause the electronic device to perform as described in any of clauses 1-14 method.

Clause 16. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of clauses 1-14.

Claims

A method of processing multidimensional data, comprising:

receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics;

The first intermediate representation is parsed, and the first intermediate representation is converted into a target program, wherein the target program preserves multidimensional data semantics.
The method of claim 1, wherein the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
The method of claim 1 or 2, further comprising:

An abstract language representation is received and compiled into the first intermediate representation; wherein the abstract language representation is editable by a user.
4. The method of claim 3, wherein a second intermediate representation is received to form the abstract linguistic representation, the second intermediate representation comprising a graphically expressed intermediate representation.
The method according to claim 1 or 2, converting the first intermediate representation into an object program comprises:

converting the first intermediate representation to an abstract language representation, wherein the abstract language representation contains multidimensional data semantics; and

The abstract language representation is converted into the object program.
The method of any of claims 1-5, further comprising: receiving a second intermediate representation and converting the second intermediate representation to the first intermediate representation;

Wherein, the second intermediate representation includes a graphically expressed intermediate representation.
The method according to claim 4 or 6, further comprising:

Parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network;

The second intermediate representation is obtained according to the operation node and the topological connection relationship.
The method according to any one of claims 1-7, further comprising: optimizing the first intermediate representation to generate an object program based on the optimized first intermediate representation.
9. The method of claim 8, wherein optimizing the first intermediate representation comprises:

Converting the order of the first dimension of the multi-dimensional data to the order of the second dimension so as to adapt to the corresponding neural network accelerator; and/or performing operator fusion of the first operator and the second operator.
The method of any of claims 1-9, wherein the abstract language representation is formed based on a virtual processor comprising:

An I/O interface circuit, a control circuit and an arithmetic component, the arithmetic component includes a first storage circuit and an arithmetic circuit;

the I/O interface circuit is configured for input and output of the virtual processor;

the control circuit is configured to perform an access operation through the I/O interface circuit;

The first storage circuit is configured to read at least input data and weight data through the I/O interface circuit;

The operation circuit is configured to read the input data and weight data from the first storage circuit for operation.
The method of claim 10, wherein,

The first storage circuit includes: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data;

The operation circuit includes a parallel functional unit PFU for performing operations on non-scalar data;

The I/O interface circuit is connected to the control circuit, the PWM and the PNM; the control circuit is connected to the PWM, the PNM and the PFU; the PWM and the PNM are both connected to the PFU.
11. The method of claim 10 or 11, wherein the virtual processor further comprises a scalar functional unit SFU, connected to the control circuit and the I/O interface circuit, configured to operate on scalar data .
The method according to any one of claims 10-12, wherein the virtual processor further comprises a shared memory circuit PSM, and the number of the arithmetic components is multiple;

The PSM is configured to read input data and weight data through the I/O interface circuit;

The plurality of operation components are connected in parallel to the PSM and are configured to read the input data and weight data from the shared memory circuit for operation.
6. The method of claim 6, wherein converting the second intermediate representation to the first intermediate representation comprises:

The multi-dimensional data of the second intermediate representation is divided into a plurality of sub-multi-dimensional data according to the size of the storage basic block BBM, and the plurality of sub-multi-dimensional data is divided into the shared storage circuit PSM from the off-chip memory multiple times;

Split the sub-multidimensional data in the PSM according to the calculation basic block BBC, load the input data in the split sub-multidimensional data into the parallel neuron memory PNM, and load the weight data into the parallel weight memory PWM;

After the input data and the weight data are operated, an intermediate result is obtained, and the intermediate result is stored in the PSM; and

The intermediate results are stored in the off-chip memory.
An electronic device comprising:

one or more processors; and

a memory having computer-executable instructions stored therein which, when executed by the one or more processors, cause the electronic device to perform the performance of any one of claims 1-14 Methods.
A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of claims 1-14.