WO2022078400A1 - Device and method for processing multi-dimensional data, and computer program product - Google Patents

Device and method for processing multi-dimensional data, and computer program product Download PDF

Info

Publication number
WO2022078400A1
WO2022078400A1 PCT/CN2021/123569 CN2021123569W WO2022078400A1 WO 2022078400 A1 WO2022078400 A1 WO 2022078400A1 CN 2021123569 W CN2021123569 W CN 2021123569W WO 2022078400 A1 WO2022078400 A1 WO 2022078400A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
intermediate representation
representation
memory
pnm
Prior art date
Application number
PCT/CN2021/123569
Other languages
French (fr)
Chinese (zh)
Inventor
董守杨
文渊博
杨君
马晓东
苏振宇
陈峋宇
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2022078400A1 publication Critical patent/WO2022078400A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the field of computers, and more particularly, to the field of processing multidimensional data.
  • the purpose of the present disclosure is to solve the problem that tensor data needs to be split into scalar data in the prior art, and to provide a method and device capable of retaining tensor primitives in the entire processing process.
  • a method for processing multidimensional data comprising: receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics; parsing the first intermediate representation, and converting the first intermediate representation The intermediate representation is converted into a target program, where the target program preserves multidimensional data semantics.
  • an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.
  • a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
  • the proposed Tensor Intact Compling (TIC) architecture can improve performance, improve efficiency, and improve portability.
  • FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure
  • Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure
  • Figure 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a;
  • Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure
  • Fig. 4a shows a device for processing multi-dimensional data according to an embodiment of the present disclosure
  • Fig. 4b shows a flowchart of the steps performed by the first processing device according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a tensor retention architecture according to another embodiment of the present disclosure
  • FIGS. 6a to 6d show schematic structural diagrams of various neural network accelerators/processors
  • FIG. 7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., thereby forming a multi-dimensional data processor with common features;
  • Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure
  • Figure 7c shows the structure of the multi-layer processor and its functions
  • Figure 8a shows an exemplary code diagram of the second intermediate representation TIR
  • Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure
  • Fig. 9a and Fig. 9b show the traditional neural network operation and the schematic diagram of the neural network after operator fusion
  • FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure
  • FIG. 11 shows a storage manner of data when there are multiple PSMs according to one embodiment of the present disclosure
  • Figures 12a-12d show schematic diagrams of data rotation when multiple PSMs are performing data storage
  • Figures 13a and 13b depict schematic diagrams of mapping parallel tasks to parallel processing clusters in virtual processors
  • Figure 14 shows the performance of TIC's technology and other benchmark technologies on GPU-TC
  • Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU
  • Figure 16 shows the performance of TIC's technology and other benchmark technologies on TPU
  • Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU;
  • Figure 18 shows a combined processing device
  • Figure 19 shows an exemplary board.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure.
  • the methods of the embodiments of the present application can be applied to a processor, and a compiler, a compiling component, or a compiling program can run on the processor.
  • a compiler, compiling component, or compiling program may be used to perform at least one step in the method.
  • the method may include: in operation S110, receiving a first intermediate representation, the first intermediate representation has multi-dimensional data semantics; and in operation S130, parsing the A first intermediate representation, and converting the first intermediate representation into a target program, wherein the target program preserves multidimensional data semantics.
  • the multi-dimensional data of the present disclosure may include non-scalar data such as vector data, matrix data, and tensor data, and may also be any other higher-dimensional data.
  • the technical solutions of the present disclosure can also process scalar data, which will be described later. However, it should be understood that the following description will mainly take tensor data as an example.
  • the semantics of multi-dimensional data are always maintained from the time of receiving to parsing, rather than a cycle in which multi-dimensional data is split into multiple scalar data as in the traditional technology. Therefore, in the solution of the present disclosure, there is no need to first split multi-dimensional data into scalar data, and then combine the scalar data into multi-dimensional data, thereby reducing intermediate conversion processes and improving computing efficiency.
  • the multi-dimensional data is always maintained, it is more intuitive for the user, so it is also convenient for the user to edit the data and the above-mentioned structure. Furthermore, the elimination of the semantic gap further improves the efficiency of programming and operations.
  • the semantics of preserving multi-dimensional data can be implemented by corresponding programming languages, such as programming languages with Conv semantics and Tensor (tensor) types.
  • programming languages with Conv semantics and Tensor (tensor) types can greatly reduce the amount of programming and preserve the semantics of multi-dimensional data (such as tensor data).
  • the target program described above may include high-level languages supported by neural network hardware, such as languages such as CUDA C and BANG C described above. These target programs can also handle multidimensional data and retain multidimensional data semantics.
  • the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
  • operation information may refer to various information for operating data, such as the structure of the neural network, operators in the neural network, data access operations, data optimization, etc., which describe the data Perform all related operations of processing and computing.
  • the operation information of multidimensional data here still includes the semantics of multidimensional data.
  • the structure of the neural network can refer to the relationship between each operator and other operators in the neural network, the relationship between the input and output of the operators, etc., which describes the overall structure of the neural network.
  • An operator in a neural network can be any information describing an operator, such as the type of the operator, whether the operator is a single operator or a combination operator of multiple single operators, etc.
  • Data attributes can describe the type of data, such as float type, fix type, and so on. It should be understood that the above-mentioned types are merely examples and not limitations of the present disclosure.
  • the order of dimensions can be NHWC or NCHW.
  • N represents how many images there are in this batch
  • H represents how many pixels the image has in the vertical direction
  • W represents the number of pixels in the horizontal direction
  • C represents the number of channels (such as the channels of black and white images).
  • NHWC has better memory access locality (one output pixel can be obtained for every three input pixels), while NCHW must wait for all channel inputs to be ready to obtain the final output result, which requires a large temporary space.
  • a suitable format can be determined according to actual needs, such as the processing capability of the accelerator and the compatible dimensional order.
  • Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure.
  • converting the first intermediate representation into a target program S130 includes: in operation S1310, converting the first intermediate representation into an abstract language representation, wherein the abstract language representation includes multi-dimensional data semantics and, in operation S1330, converting the abstract language representation into the target program.
  • FIG 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a.
  • the architecture of the present disclosure is called Tensor Intact Compling (TIC).
  • the architecture may include: machine learning applications A 0 , A 1 , . Representation, the first intermediate representation as described above); tensor-aware language module TAL (Tensor Aware Language, as described above); tensor abstract machine module TAM (Tensor Abstract Machine); back-end high-level language , such as CUDA C, BANG C, XLA-TPU, etc., wherein, the above-mentioned target program may refer to the high-level language of the back end; and machine learning hardware H 0 , H 1 , . . . , H M-1 .
  • TIR is an intermediate representation that is designed to meet the needs of machine learning and can represent scalar, vector, matrix, and tensor operations. Therefore, in addition to regular scalar operations (such as arithmetic operations, logical operations, comparison operations, memory operations, function calls, and conditional operations, etc.), TIR can also provide descriptions of vectors, matrices, and tensors, thereby maintaining the semantics of these data.
  • the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.
  • the traditional intermediate representation can include a graphical intermediate representation (Graph Intermediate Representation, GIR).
  • GIR Graph Intermediate Representation
  • the graphical intermediate representation can be a computational graph intermediate representation obtained after parsing by the deep learning framework Tensorflow, or a neural network compilation.
  • the middle of the calculation graph of the framework TVM represents NNVM or Relay, etc., which is only used for illustration here, and is not used to limit the scope of this application.
  • the original IR is usually split into multiple intermediate representations of scalar computation loops, and then the target program is generated according to the intermediate representations of the scalar computation loops.
  • the IR in the form of tensor needs to be converted into the form of scalar first, and then the scalar form is converted into the target program that supports tensor semantics.
  • This way of converting back and forth between tensors and scalars is very verbose and error prone.
  • it is not necessary to convert tensor data into scalar data but the semantics of tensors can be preserved; in addition, for example, large tensor operations can be converted into small tensor operations, compared with traditional IR operations.
  • Splitting tensor operations into scalar operations in TIR makes TIR more intuitive, which can improve development efficiency for users.
  • the semantics of tensors are always maintained, and there is no need to perform conversion between tensors and scalars, thereby improving the efficiency of code compilation and conversion.
  • Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure.
  • the second intermediate representation can be obtained by: parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network; according to the operation nodes and the topological connection relationship to obtain the second intermediate representation.
  • the neural network model file described here can be a Json file, which records the structure, operators and other information of the neural network, and details of the neural network can be obtained from the Json file.
  • x is the input data, which is subjected to a convolution operation with the weight data; the intermediate data generated after the convolution operation is added with the data y (bias value bias), and finally the calculation result is obtained, wherein y is the bias data in the convolution operation.
  • y is the bias data in the convolution operation.
  • one tensor data is divided into a plurality of scalar data during calculation, and is represented by, for example, a for loop.
  • tensor computation is divided into scalar computation, thus losing the semantics of tensor data. This brings a heavy programming burden to the user and is computationally inefficient.
  • Figure 4b shows an example of a TIR representation according to one embodiment of the present disclosure.
  • the TIR indicates that the order of dimensions included can be NCHW, and specifies the size of N "batch_size”, the value of C “output_channel”, and the height “height” and width "width”.
  • the first intermediate representation also includes data "data” and kernel “kernel” and their convolution operation information and data type (eg float 16).
  • the semantic scheme of tensor data included in FIG. 4b is not only applicable to tensor data, but also to vector data and matrix data, where vector data is one-dimensional data, and matrix data is two-dimensional data.
  • the programming manner for preserving the semantics of multi-dimensional data is not limited to the scheme shown in FIG. 4b, and any other manners capable of preserving the semantics of multi-dimensional data are included within the scope of the present disclosure.
  • TAL is an abstract language representation that can be built upon extensions to the C language, taking into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computational units, etc.). These hardware features will be described later when the TAM is introduced.
  • the goal of TAL is to provide users with alternatives that can be easily modified according to their needs.
  • TAL can be converted to code for target platforms, such target platforms are characterized by having native tensor instructions, such as wmma for GPU-TC, etc.
  • the process of converting the TAL into the code of the target platform may include: firstly converting the TAL into a target program, and then compiling the target program into machine instructions that the target platform can run.
  • the TAM is a programming model provided to users of software programming that contains a basic abstraction for hardware accelerators.
  • the TAM in the embodiments of the present application may abstract common features of various neural network accelerators, and extract various key and common features of tensor processing in various machine learning architectures. Based on this TAM, hardware-aware optimizations can be performed at higher layers, and these features can even be exposed to the user. Since the TAM can be instantiated for different specific platforms (eg GPU-TC), etc., the portability of the system can be significantly improved.
  • TIR, TAL, and TAM supporting TAL can all process multi-dimensional data, and can recognize multi-dimensional semantics. Therefore, in the above conversion process, there is no need to convert multi-dimensional data such as tensor data into scalar data. , but can work directly in the context of multidimensional data, thus reducing or eliminating the need to convert multidimensional data to scalar data and/or convert scalar data to multidimensional data.
  • TAM is an abstraction of common features of various neural network accelerators, so it can also be regarded as a specific neural network accelerator. Just like a common neural network accelerator can run corresponding machine instructions, this particular neural network accelerator can also run a corresponding target program. It should be understood that the target program here is a general term, which may be a user-editable high-level language.
  • TAM may correspond to specific neural network accelerator hardware, and TAL may correspond to the above-mentioned target program.
  • GIR intermediate representation
  • the graphical representation in Figure 2b can be a framework, examples of which can include: Caffe (convolutional neural network framework), Tensorflow, Mxnet, Pytorch, PaddlePaddle (Baidu flying pulp), etc.
  • Caffe convolutional neural network framework
  • Tensorflow Tensorflow
  • Mxnet Mxnet
  • Pytorch Pytorch
  • PaddlePaddle PaddlePaddle
  • TAM can support the operation of TAL, just like the hardware neural network accelerator supports a specific language, such as GPU-TC supports CUDA C, MLU supports Bang C, and so on.
  • the machine learning hardware may exemplarily include H 0 , H 1 . . . H M-1 , and the languages supported by the machine learning hardware may include CUDA C, BANG C, TPU, etc., so The supported languages depend on the specific hardware architecture.
  • the abstract language means that TAL can easily run on various architectures of TAM.
  • TIC architecture of one embodiment of the present disclosure is described above in conjunction with Figures 2a-4b.
  • a TIC architecture according to another embodiment of the present disclosure is described below.
  • converting the first intermediate representation into a target program S130 may include: receiving an abstract language representation, and compiling the abstract language representation into the first intermediate representation; wherein the abstract language representation Indicates editable for the user.
  • Figure 5 shows a schematic diagram of a tensor-preserving architecture according to the above method.
  • the architecture shown in Figure 5 may include: machine learning applications A 0 , A 1 , . . . , A M-1 ; framework; graphical representation intermediate representation GIR; Quantity-aware language module TAL; tensor abstraction machine module TAM; back-end high-level machine languages such as CUDA C, BANG C, XLA-TPU, etc.; and machine learning hardware H 0 , H 1 , ..., H M-1 .
  • the abstract linguistic representation may be formed by receiving a second intermediate representation, the second intermediate representation comprising a graphically expressed intermediate representation.
  • the multidimensional data from the GIR can be received to form the TAL, and then the abstract transformation representation TAL is transformed into the first intermediate representation TIR, which is similar to the GIR first transformed into TIR and then into TAL shown in Figure 2b is different.
  • the information in the TIR can be edited in the TAL first to form the form of subsequent processing. For example, operators derived from GIR can be modified in TAL.
  • the user can directly edit new operators in the TAL without receiving existing operators from the GIR. Users can form new operators that are not in the framework according to their own needs. This makes the architecture provided by the present disclosure more flexible and adaptable.
  • the TAL in FIG. 5 can also directly receive multi-dimensional data in the frame, such as operators in the frame, etc., without receiving operators from the GIR.
  • the TAL in Figure 5 can also be built based on extended C language extensions, which also take into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computing units, etc.).
  • extended C language extensions also take into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computing units, etc.).
  • the goal of TAL is to provide users with alternatives that can be modified according to the user's needs, such as defining new operators or modifying existing operators for the desired optimization.
  • the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.
  • a portion of the GIR may be converted to TAL, which is then edited by the user, and another portion may be converted to the TIR of the present disclosure.
  • the difference between GIR and TIR has been described above with reference to FIG. 4a and FIG. 4b, and will not be repeated here.
  • the middle of the graph shown in Figure 5 indicates that GIR can all be converted to TIR, and the user-defined operator using TAL can also be converted to TIR. It should be clear that the middle of the graphical representation in FIG. 5 represents the connection between GIR and TAL only schematically expressing a possible implementation, and the connection does not necessarily exist.
  • the machine learning hardware can be various hardware, such as GPU, MLU, etc., each hardware has its own programming language, such as CUDA C, Bang C, etc., which are designed for hardware High-level programming languages can serve as backends.
  • TAL can be formed based on CUDA C, Bang C and other programming languages designed for specific hardware, so as to facilitate users to edit. This helps to maximize the performance of the hardware.
  • FIGS. 6a to 6d show schematic diagrams of structures of various neural network accelerators/processors.
  • the neural network accelerator is a schematic structural diagram of Cambricon-ACC, which includes an I/O interface circuit, a controller, a vector SPM storage, a matrix SPM storage, a vector functional unit VFU, and a matrix MFU.
  • the I/O receive ports are connected to the controller, SFU, vector SPM, and matrix SPM, while vector SPM is connected to VFU, and matrix SPM is connected to MFU.
  • the vector SPM is used to store and vector data
  • the matrix SPM is used to store the matrix data
  • the VFU accesses the vector data in the vector SPM
  • the MFU accesses the matrix data in the matrix SPM.
  • Figure 6b is a schematic diagram of a multilayer structure of Figure 6a.
  • the neural network accelerator includes an I/O interface circuit, a controller, a cluster memory, and a plurality of parallel computing components, each computing component includes a plurality of processing units P0-Pn, and each processing unit includes Vector SPM storage, matrix SPM storage, vector functional unit VFU, and matrix MFU shown in 6a.
  • Multiple parallel computing components are connected to the cluster memory, and the cluster memory is connected to the controller.
  • the computing components, the cluster memory, and the controller perform data access through the I/O interface circuit.
  • FIG. 6c shows the structure of a tensor processing unit TPU.
  • the TPU includes an I/O interface circuit, a controller, a unified buffer (Unified Buffer), a weight first-in-first-out memory (Weight FIFO), and an arithmetic component.
  • the I/O interface circuit is connected to the controller, the unified buffer, and the weight FIFO memory.
  • the arithmetic components include MMU, activation components, and normalization/pooling components, etc., which are connected to the unified buffer and weight FIFO. memory to access these memories.
  • Figure 6d shows the structure of a GPU-TC.
  • the GPU includes an I/O interface circuit, a controller, and a plurality of computing components.
  • the I/O interface circuit is connected with the controller and each arithmetic component, and the controller is connected with each arithmetic component.
  • Each computing component includes a shared memory and a plurality of tensor processing cores G0-Gn connected to the shared memory.
  • FIG. 7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., to form a multi-dimensional data processor with common features.
  • the virtual processor 320 may include an I/O interface circuit 3210, a control circuit 3220 and an operation component 3230, and the operation component may include a first storage circuit 3231 and an operation circuit 3233;
  • the I/O interface circuit 3210 may be configured for input and output of the virtual processor 320;
  • the control circuit 3220 may be configured to perform access operations through the I/O interface circuit 3210;
  • the first storage circuit 3231 may be configured to perform an access operation through the
  • the I/O interface circuit 3210 reads at least input data and weight data;
  • the operation circuit 3233 may be configured to read the input data and weight data from the first storage circuit 3231 for operation.
  • control circuit 3220 is connected to the I/O interface circuit 3210 and the arithmetic component 3230 to control the I/O interface circuit 3210 and the arithmetic component 3230.
  • the input and output of the virtual processor 320 can be weight data, input Any input and output of data, intermediate data, instructions, code, etc.
  • the operation circuit 3233 accesses the first storage circuit 3231 and reads the required content therefrom, and stores the calculated content into the first storage circuit 3231 .
  • Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure.
  • the first storage circuit 3231 may include: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data; the operation
  • the circuit 3233 may include a parallel functional unit PFU for performing operations on non-scalar data;
  • the I/O interface circuit 3210 may be connected to the control circuit 3220, PWM and the PNM;
  • the control circuit 3220 is connected to the PWM , the PNM and the PFU are connected; both the PWM and the PNM are connected to the PFU.
  • the input data and weight data for the neural network operation can be stored in PNM and PWM respectively, so as to facilitate the access of the operation circuit 3233;
  • the parallel functional unit PFU can extract the required data from the PNM and PWM , and can handle vector data, matrix data, and scalar data, as well as higher-dimensional data.
  • the virtual processor of the present disclosure further includes a scalar functional unit SFU, the SFU is connected to the control circuit and the I/O interface circuit, and is configured to process scalar data. data to operate.
  • SFU scalar functional unit
  • FIG. 7b introduces the structure and function of the single-layer processor, and the structure and function of the multi-layer processor are described below with reference to FIG. 6c.
  • the virtual processor further includes a shared memory circuit PSM, and the number of the arithmetic components is multiple; the PSM is configured to read input data and weight data through the I/O interface circuit ; the plurality of operation components are connected in parallel to the PSM, and are configured to read the input data and weight data from the shared memory circuit for operation.
  • PSM shared memory circuit
  • the plurality of operation components are connected in parallel to the PSM, and are configured to read the input data and weight data from the shared memory circuit for operation.
  • the shared memory circuit PSM is connected to multiple operation components, the input data, weight data, etc. required by the operation components are first stored in the PSM, and then these operation components obtain these data from the PSM.
  • the PSM is visible to the user, and the user can explicitly manage the PSM.
  • the single-layer TAM structure shown in Fig. 7b can be mapped to the single-layer structure shown in Fig. 6a and Fig. 6c; the double-layer TAM structure shown in Fig. 7c can be mapped to the single-layer TAM structure shown in Fig. 6b and 6d in the two-layer structure.
  • the structures shown in Figs. 7a to 7c are only an example, and the structures of any other neural network accelerators can also be abstracted.
  • the TAM does not remain unchanged, but can be changed according to the hardware structure to be abstracted.
  • control circuit reads the program instruction from the external storage DRAM
  • PFU reads input data and weight data from internal memory PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;
  • the control circuit reads the program instructions from the external storage DRAM;
  • PFU reads input data and weight data from internal storage PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;
  • the process of converting Graph IR to TIR may include: splitting the multidimensional data represented by the second intermediate into at least one sub-multidimensional data,
  • the categories include input data and weights), and the operations that each sub-multidimensional data needs to participate in, determine the storage space of each sub-multidimensional data; generate space allocation instructions, memory access instructions, and operation instructions related to the sub-multidimensional data.
  • the space allocation instruction, memory access instruction and operation instruction related to the sub-multidimensional data are the instructions defined by the first intermediate representation.
  • Figure 8a shows an exemplary code diagram of the first intermediate representation TIR.
  • Tensor(fp32) ⁇ NCHW>(1,3,224,224)x The data type of x is float32, the format is MCHW, and the size is (1,3,224,224)
  • Tensor(fp32) ⁇ NCHW>(64,3,3,3)y The data type of y is float32, the format is MCHW, and the size is (64,3,3,3)
  • Tensor(fp32) ⁇ NCHW>(1,64,224,224)Temp The data type of Temp is float32, the format is MCHW, and the size is (1,64,224,224).
  • TIR also provides space allocation instructions related to the storage space allocation of input data x, weight data weight, temporary data Temp, bias data y, and result data Result, etc., for example:
  • TIR also gives access instructions to load data from off-chip memory (such as GDRAM) to on-chip memory, such as the instruction load x.gdram to x.pnm; TIR also gives a description of the operation. Operation instructions, such as instructions
  • Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure.
  • converting the second intermediate representation into the first intermediate representation may include: in operation S810, splitting the multidimensional data of the second intermediate representation according to the size of the BBM (Building Block of Memory) For a plurality of sub-multidimensional data, a space allocation instruction and a memory fetch instruction are generated to instruct to apply for corresponding storage space in the shared memory circuit PSM according to the space allocation instruction, and the plurality of sub-multidimensional data are divided into multiple slices according to the memory fetch instruction.
  • the external memory is loaded into the shared storage circuit PSM; in operation S820, the sub-multidimensional data in the PSM is split according to the calculation basic block BBC (Building Block of Computation), and a space allocation instruction (as shown in the allocate in Figure 8a) is generated.
  • Figure 8b shows the flow chart of GIR conversion to TIR.
  • two basic blocks can be defined, namely, the storage basic block and the calculation basic block; wherein, the storage basic block BBM can be the smallest granularity of data copy,
  • the size of the storage basic block is determined by the shared memory circuit PSM;
  • the computing basic block can be the smallest granularity of the operation, and the size of the computing basic block is also limited by the constraints of PNM, PWM and the computing instruction of the current operation.
  • the compiler can split the off-chip memory such as DRAM to the PSM dimension according to the original data and the size of the storage basic block, that is, load the original input data tensor A on the DRAM and only load the BBM to the PSM each time. The number of times loaded is tensor A/BBM times.
  • the compiler can split PSM ⁇ PWM, PSM ⁇ PNM according to the storage basic block BBM and the calculation basic block.
  • the operation can be completed through the operation instructions defined by the TIR.
  • the result after the PFU calculation is completed is stored in the PSM according to the current split position, and then the PSM is stored in the DRAM according to the current split position.
  • Figures 8a and 8b are only used to illustrate the conversion process of the intermediate representation, and the instructions contained in the above-mentioned intermediate representation are only a form of intermediate representation, not hardware instructions that can be executed by the processor.
  • the operation process shown in the flowchart of FIG. 8b is only used to illustrate the role of the intermediate representation, and is not used to limit the specific operation process.
  • a processor needs to convert the intermediate representation shown in Figure 8a into specific machine instructions, and the processor can implement the above-mentioned operation process according to the specific machine instructions.
  • the compiler may further optimize the first intermediate representation to generate an object program according to the optimized first intermediate representation.
  • the first intermediate representation may be described in detail below.
  • TIR TIR-infrared ray tracing
  • Traditional graphical intermediate representations eg Relay
  • the optimization scheme according to the embodiment of the present disclosure is given as follows.
  • optimizing the first intermediate representation by the compiler may include: converting the first dimension order of the multidimensional data to the second dimension order to adapt to the corresponding neural network accelerator.
  • the above optimizations are mainly for GPU operations. Since the tensor data in TensorFlow is in NHWC format by default, and it is more efficient to use NVHW in GPU, two transformation nodes can be used when optimizing, namely the NHWC to NCHW transformation node, and the NCHW to NHWC transformation node , the transitions occurring between consecutive NHWC-to-NCHW transition nodes and NCHW-to-NHWC transition nodes between two consecutive GPU compute nodes can cancel each other out.
  • optimizing the first intermediate representation by the compiler may further include: performing operator fusion on the first operator and the second operator.
  • operator fusion can also be called layer fusion, which can fuse multiple layers in the neural network to compile and generate instructions, reduce the number of accesses to off-chip memory, and thus improve data throughput.
  • the calculation process of the first operator and the second operator is as follows:
  • the intermediate results between the two operators are stored in the on-chip memory, so that the second operator can read data from the on-chip memory.
  • Figures 9a and 9b are schematic diagrams showing traditional neural network operations and a neural network after operator fusion.
  • the PFU operation reads the input data and weight data from PNM and PWM to complete the operation, and writes the result of the first operator back to the PNM;
  • PFU reads data from PNM and PWM to complete the operation, and writes the result of the second operator back to PNM;
  • the intermediate result between two operators needs to be stored in off-chip memory, and when the next operator performs an operation, the intermediate result needs to be read from the off-chip memory, which will cause each operation All need to read data from the off-chip memory, which obviously reduces the operation speed of the entire neural network, and the data access operation to the off-chip memory is also likely to become a bottleneck for the operation speed of the neural network.
  • the intermediate result can be stored in the on-chip memory (eg, SRAM), thereby improving the data access speed.
  • the first processor performing operator fusion on the first operator and the second operator includes:
  • the first weight data to form a plurality of first sub-weight data, so that the first sub-operation result of the first input data and the first sub-weight data is smaller than the capacity of the PNM;
  • the data and the plurality of first sub-weight data are operated in turn, and each time a first sub-operation result is obtained, the first sub-operation result is stored in the PNM, so that the second operator can perform this operation in the second operator.
  • the first sub-operation result is operated to obtain the second sub-operation result.
  • FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure.
  • the far left shows the input data
  • the intermediate result in Figure 10 is assumed to be larger than the capacity of the PNM, so it will not be possible to store the intermediate result in the on-chip memory.
  • the weight data may be split, for example, into weight data 1 and weight data 2 .
  • Weight data 1 is represented by light squares
  • weight data 2 is represented by dark squares.
  • the input data is first subjected to a convolution operation (first operator) with the weight data 1, as shown by the route 1 in FIG. 10 .
  • the intermediate data generated by the convolution operation (shown as the light-colored square in the intermediate result) is stored in on-chip memory.
  • the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 2), and the obtained output result is stored in the off-chip memory.
  • a convolution operation is performed on the input data and the weight data 2, as shown in the route 3 of FIG. 10 .
  • the intermediate data generated by the convolution operation (shown as dark squares in the intermediate result) is stored in on-chip memory.
  • the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 4), and the obtained output structure is stored on the off-chip memory.
  • the intermediate result is not generated at one time, but is formed by dividing the weight data into blocks.
  • Each generated intermediate structure can be stored in the on-chip memory, and the second operator does not need to read data from the off-chip memory, thus reducing the number of accesses to the off-chip memory and improving the operation efficiency.
  • Fig. 11 shows a data storage manner when there are multiple PSMs according to an embodiment of the present disclosure
  • Figs. 12a-12d show schematic diagrams of data rotation when multiple PSMs store data.
  • the first processor when there are multiple PSMs, is further configured to: in operation S1110, perform multiple rotation storage of multiple sets of weight data in the multiple PSMs; In operation S1120, the weight data in the plurality of PSMs are operated upon every rotation; in operation S1130, after the rotation of the weight data in all the PSMs is completed, new weight data is read into the plurality of PSMs. in PSM.
  • the memory of the PFU (eg PNM and PWM) is relatively close to the computational unit, so this memory needs to be used carefully to avoid execution pipeline stalls. More specifically, the programmer needs to calculate the size of the on-chip buffer required for the computation. If the size required for one calculation exceeds the size of the PFU memory, the compilation process will stop with an error message.
  • the method to improve the utilization of PWM is given below, that is, by storing the weight data in the PWM, multiple convolution operations can be performed on different parts of the input data. Compared with the traditional method, the off-chip memory needs to be accessed continuously. The efficiency of the method of the present disclosure is improved by a factor of 1.6.
  • each PSM memory can read corresponding data, such as weight data, from off-chip memory (eg, DRAM).
  • off-chip memory eg, DRAM
  • PSM0, PSM1, PSM2 and PSM3 four PSMs are shown, namely PSM0, PSM1, PSM2 and PSM3, these PSMs can store four sets of weight data, respectively expressed as weight data A, weight data B, weight data Data C and weight data D.
  • weight data A is stored in PSM0
  • weight data B is stored in PSM1
  • weight data C is stored in PSM2
  • weight data D is stored in PSM3.
  • the four groups of weight data can be stored in the PSM alternately.
  • the weight data A is transferred from PSM0 to PSM1
  • the weight data B is transferred from PSM1 to PSM2
  • the weight data C is transferred from PSM2 to PSM3
  • the weight data is transferred from PSM2 to PSM3.
  • Data D is transferred from PSM3 to PSM0.
  • weight data D After the above-mentioned weight data D, weight data A, weight data B and weight data C are calculated, the next rotation is performed.
  • weight data A is transferred from PSM1 to PSM2
  • weight data B is transferred from PSM2 to PSM3
  • weight data C is transferred from PSM3 to PSM0
  • weight data D is transferred from PSM0 Transfer to PSM1.
  • weight data A is transferred from PSM2 to PSM3
  • weight data B is transferred from PSM3 to PSM0
  • weight data C is transferred from PSM0 to PSM1
  • weight data D is transferred from PSM1 Transfer to PSM2.
  • the user can implement the approach shown in Figures 11 and 12a-12d by editing the TAL, which partially compensates for the accesses to the PFU memory (e.g. 10 clock cycles) with a compromised access latency. And the gap between accesses to off-chip memory DRAM (eg 300 clock cycles).
  • processing core synchronization, processing cluster synchronization, and/or chip synchronization can be performed at the TAL; and/or parallel tasks are mapped to parallel processing clusters in virtual processors.
  • control logic of this embodiment can also be realized by editing the TAL by the user.
  • the main purpose of exposing control logic at the TAL level is to provide functional correctness and execution efficiency, the main feature of which is the synchronization and parallelization of a large number of computational units.
  • Processing core synchronization is to ensure the correctness of parallel execution of pipelines of different functional units such as scalars, vectors, and matrices.
  • the processing cluster includes multiple processing cores, and the processing cluster synchronization is to maintain the synchronization of all processors in a specific cluster in the same space.
  • Chip synchronization ensures that all clusters will continue to execute, disposing of all clusters reaching the synchronization point. It should be noted that the user can hide the processing core synchronization to simplify the programming burden of the programmer.
  • One potential optimization method is software pipelining, which can be used to hide the latency of memory accesses.
  • Such a task can be mapped to 2 processing clusters in TAM i.e. processing cluster 0 and processing cluster 1, each processing cluster has 4 processor cores i.e. processing core 0, processing core 1, processing core 2 and processing core 3.
  • TAM is used to abstract common features of multiple neural network accelerators, which extract various key and common features of tensor processing in various ML architectures. Therefore, when a task is mapped into a TAM, the TAM can be further mapped to specific hardware accelerators.
  • GPU-TC and Cambricon-CC are taken as examples to illustrate.
  • two SMs can be used. While each SM may include, for example, 8 tensor processing cores, in this task, each stream processor SM only needs to use 4 tensor processing cores.
  • each processing cluster has 4 processing cores, so in this task, each processing cluster can use 4 processing cores.
  • the task When the task requires more processing clusters or processing cores, it can be run in a time-sharing manner, that is, the task can be divided into multiple executions.
  • control logic can also be implemented in TAL, so that users can customize it according to actual needs.
  • the generated code can be converted into an object program adapted to specific hardware.
  • the target program when the underlying hardware is MLU, the target program can be Bang C language to adapt to the MLU hardware; when the underlying hardware is GPU-TC, the target program can be CUDA C language. It can be understood that when the underlying hardware is other accelerators, the target program can be converted into machine instructions suitable for the accelerator.
  • TAM can be instantiated into different specific platforms (such as MLU, TPU, GPU-TC), etc., are applicable to various accelerators, which significantly improves the portability of the system and is beneficial to the Porting on various accelerator hardware platforms.
  • TIC three types of machine learning computers are used, namely GPU with tensor processing core (GPU-TC), MLU and TPU.
  • the neural network algorithms used for evaluation come from different application scenarios, including ResNet-50 and VGG16.
  • ResNet and VGG are not only used for image classification, but also as backbone networks for general feature extraction.
  • the first benchmark is the TVM stack, which directly supports GPU-TC, MLU, and TPU by rewriting tensor primitives.
  • the second benchmark is Glow IR, which includes two layers of IR, a high-level graphics IR (mainly for graphics optimization) and a low-level instruction IR (mainly for memory-related optimizations).
  • the third benchmark is the TensorFlow framework. The above benchmarks were originally designed primarily for CPUs and GPUs (without tensor processing cores), but were modified according to the technical solutions of the present application to support GPU-TC, MLU, and TPU.
  • the experimental and comparative results mainly include three aspects: performance, efficiency and portability. It will be introduced in detail below.
  • FIG. 14 shows the performance of TIC’s technique and other benchmark techniques on GPU-TC, where execution latency is normalized to TensorFlow’s latency.
  • the average performance improvement of TIC is about 201%.
  • the main reason is that unnecessary architectural overhead is avoided and multiple optimizations are made at TAL.
  • the average performance gain over Glow is around 34.8% because Glow's backend is implemented directly through CUDA C rather than through a highly optimized library.
  • the average performance gain is about 13.7%.
  • the beneficial effects mainly come from two optimization processes, namely, the sequential optimization of data dimensions with the help of TIR and the optimization of operator fusion.
  • Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU.
  • the average performance of TIC is about 96.9% of TensorFlow's performance.
  • the main reason is that TensorFlow for MLU runs on a highly optimized library, and more optimization measures can be applied to the technical solution of TIC.
  • the performance improvement of the disclosed technical solution is about 41.4% compared to the performance of TensorFlow, because several customized optimizations are performed on the technology of TIC.
  • the performance gains are about 23.5% and 20.7%, respectively, which well demonstrates the efficiency of TIC's technology as a compilation architecture.
  • TPU-Lite Figure 16 shows the performance of TIC's technique and other benchmark techniques on TPU, while the original Glow cannot perform on TPU-Lite. Due to the relatively coarse granularity of the considered TPU primitives, the optimization that can be performed on the technical solution of TIC is very limited. Therefore, the performance is relatively close for different implementations.
  • the efficiency can be evaluated from different perspectives. From the perspective of using programming architecture to build ML applications, the efficiency of TIC's technical solution is the same as other benchmarks due to the preservation of the programming interface. From the point of view of using TIR and TAL to build new operations, there is a significant increase in efficiency because the semantics of tensor data is preserved from the graph node to the TAM's hardware all the time.
  • Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU. It can be clearly seen that LoC drops by 43% and 38% on GPU-TC and MUL, respectively. From the perspective of using TAL to directly architect ML applications, developing applications using standard C/C++ can exhibit high efficiency. An obvious advantage is that many ready-to-use applications written in C/C++ can be directly converted to TAL without tensor-related optimizations.
  • Table 1 shows a comparison of the performance consistency achieved using TensorFlow, TVM and the TIC of the present disclosure. Quantitatively, the performance of the TIC of the present disclosure is improved by 25% and 15.4%, respectively, compared with TensorFlow and TVM.
  • TIC Tensor Preserving Compilation
  • the idea of building TIC is to preserve tensor semantics throughout the compilation process, i.e. from the upper-level programming interface to the lower-level intermediate representation and various languages, and even to the tensor-related instructions of the lower-level hardware platform.
  • the whole TIC architecture can preferably include three components, namely the tensor abstract machine module TAM, the tensor-aware language module TAL and the tensor intermediate expression module TIR, which are mainly used to solve the problems of portability, performance and efficiency, respectively.
  • TIC outperforms the state-of-the-art in performance, portability, and efficiency on GPU-TC, TPU and MLU.
  • Embodiments of the present disclosure also provide an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are processed by the one or more processors When the controller runs, the electronic device is caused to perform the method as described above.
  • a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
  • the above-mentioned method and apparatus can also be implemented as a compiling apparatus, and the compiling apparatus can constitute a combined processing apparatus.
  • FIG. 18 shows a combined processing device 1800 , which includes the above-mentioned compiling device 1802 , a general interconnection interface 1804 , and other processing devices 1806 .
  • the compiling apparatus according to the present disclosure interacts with other processing apparatuses to jointly complete the operation specified by the user.
  • Figure 18 is a schematic diagram of a combined treatment device.
  • the compiling apparatus can be implemented in various ways such as software and hardware, and it can run on any one or more of general-purpose/special-purpose processors such as CPU, graphics processor, GPU, and neural network processor.
  • general-purpose/special-purpose processors such as CPU, graphics processor, GPU, and neural network processor.
  • Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor.
  • CPU central processing unit
  • GPU graphics processing unit
  • neural network processor a processor that uses neural network to process machine learning data.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the compiling device obtains the required input data from other processing devices and writes it into the storage device on the compiling device chip; it can obtain control instructions from other processing devices and write it into the control cache on the compiling device chip; it can also read the compiling device on-chip
  • the data in the storage module is transmitted to other processing devices.
  • the structure may further include a storage device 1808, and the storage device is respectively connected to the compiling device and the other processing device.
  • the storage device is used to save the data in the compiling device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the compiling device or other processing devices.
  • the combined processing device can be used as a SOC system on a chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption.
  • the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also discloses a chip, which includes the above-mentioned compiling apparatus or combined processing apparatus.
  • the present disclosure also discloses a board including the above chip.
  • a board card is provided.
  • the above board card may also include other supporting components, including but not limited to: a storage device 1904, an interface device 1906 and a control Device 1908.
  • the storage device is connected to the chip in the chip package structure through a bus, and is used for storing data.
  • the memory device may include groups of memory cells 1910 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
  • the interface device is electrically connected to the chip in the chip package structure.
  • the interface device is used to realize data transmission between the chip and an external device 1912 (eg, a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above-mentioned other interfaces, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used for monitoring the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
  • the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
  • the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server or a network device) etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • a method of processing multidimensional data comprising:
  • the first intermediate representation is parsed, and the first intermediate representation is converted into a target program, wherein the target program preserves multidimensional data semantics.
  • Clause 2 The method of Clause 1, wherein the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
  • An abstract language representation is received and compiled into the first intermediate representation; wherein the abstract language representation is editable by a user.
  • Clause 4 The method of clause 3, wherein a second intermediate representation is received to form the abstract linguistic representation, the second intermediate representation comprising a graphically expressed intermediate representation.
  • the abstract language representation is converted into the object program.
  • Clause 6 The method of any of clauses 1-5, further comprising: receiving a second intermediate representation and converting the second intermediate representation to the first intermediate representation;
  • the second intermediate representation includes a graphically expressed intermediate representation.
  • Parsing a neural network model file where the neural network model file includes operation nodes and topological connection relationships of the neural network;
  • the second intermediate representation is obtained according to the operation node and the topological connection relationship.
  • Clause 8 The method of any of clauses 1-7, further comprising: optimizing the first intermediate representation to generate an object program from the optimized first intermediate representation.
  • the I/O interface circuit is configured for input and output of the virtual processor
  • control circuit is configured to perform an access operation through the I/O interface circuit
  • the first storage circuit is configured to read at least input data and weight data through the I/O interface circuit
  • the operation circuit is configured to read the input data and weight data from the first storage circuit for operation.
  • the first storage circuit includes: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data;
  • the operation circuit includes a parallel functional unit PFU for performing operations on non-scalar data
  • the I/O interface circuit is connected to the control circuit, the PWM and the PNM; the control circuit is connected to the PWM, the PNM and the PFU; the PWM and the PNM are both connected to the PFU.
  • Clause 12 The method of clause 10 or 11, wherein the virtual processor further comprises a scalar functional unit SFU, the SFU connected to the control circuit and the I/O interface circuit, configured to process scalar data perform operations.
  • SFU scalar functional unit
  • Clause 13 The method of any one of clauses 10-12, wherein the virtual processor further comprises a shared memory circuit PSM, and the number of the arithmetic components is plural;
  • the PSM is configured to read input data and weight data through the I/O interface circuit
  • the plurality of operation components are connected in parallel to the PSM and are configured to read the input data and weight data from the shared memory circuit for operation.
  • the multi-dimensional data of the second intermediate representation is divided into a plurality of sub-multi-dimensional data according to the size of the storage basic block BBM, and the plurality of sub-multi-dimensional data is divided into the shared storage circuit PSM from the off-chip memory multiple times;
  • the intermediate results are stored in the off-chip memory.
  • An electronic device comprising:
  • Clause 16 A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of clauses 1-14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

A device and method for processing multi-dimensional data, and an electronic device and a compilation apparatus (1802). The compilation apparatus (1802) may be comprised in a combined processing apparatus (1800). The combined processing apparatus (1800) may also comprise a universal interconnection interface (1804) and other processing apparatuses (1806). The compilation apparatus (1802) interacts with the other processing apparatuses (1806), so as to jointly complete a user-specified computing operation. The combined processing apparatus (1800) may further comprise a storage apparatus (1808). The storage apparatus (1808) is respectively connected to the compilation apparatus (1802) and the other processing apparatuses (1806), and is used for storing data of the compilation apparatus (1802) and the other processing apparatuses (1806).

Description

一种对多维数据进行处理的设备、方法和计算机程序产品A device, method and computer program product for processing multidimensional data
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年10月16日申请的,申请号为2020111120905,名称为“一种对多维数据进行处理的设备、方法和计算机程序产品”的中国专利申请的优先权。This application claims the priority of the Chinese patent application filed on October 16, 2020, the application number is 2020111120905, and the title is "A device, method and computer program product for processing multidimensional data".
技术领域technical field
本公开涉及计算机领域,更具体地,涉及对多维数据进行处理的领域。The present disclosure relates to the field of computers, and more particularly, to the field of processing multidimensional data.
背景技术Background technique
配备有加速器以提升效率的计算机已经受到了越来越多的关注。为了利用此种计算机,对于先进的编程架构具有巨大的需求,以期望能够获得较高的性能,提升软件产率并且能够确保在高度多样化的ML架构中实现较好的可移植性。尽管现有的编程工具已经意识到了张量数据的重要性以减轻编程的负担(例如TensorFlow和TVM),但这些编程工具仍然不能很好地解决上述问题。例如,在现有技术中,一般是利用循环语句来将张量数据拆分为标量数据。但是,在这种情况下,张量的语义在下降过程中或多或少地被破坏,从而丧失了潜在的进行优化的机会。Computers equipped with accelerators for increased efficiency have received increasing attention. To take advantage of such computers, there is a huge need for advanced programming architectures to achieve high performance, improve software productivity, and ensure better portability across highly diverse ML architectures. Although existing programming tools have realized the importance of tensor data to reduce the burden of programming (such as TensorFlow and TVM), these programming tools still do not solve the above problems well. For example, in the prior art, a loop statement is generally used to split tensor data into scalar data. However, in this case, the semantics of the tensors are more or less broken during the descent, thus losing potential opportunities for optimization.
发明内容SUMMARY OF THE INVENTION
本公开的目的在于解决现有技术中在处理张量数据时需要拆分为标量数据的问题,提供一种能够在整个处理过程中都保留张量原语的方法及设备。The purpose of the present disclosure is to solve the problem that tensor data needs to be split into scalar data in the prior art, and to provide a method and device capable of retaining tensor primitives in the entire processing process.
根据本公开的第一方面,提供对多维数据进行处理的方法,包括:接收第一中间表示,所述第一中间表示具有多维数据语义;解析所述第一中间表示,并将所述第一中间表示转换为目标程序,其中,所述目标程序保留了多维数据语义。According to a first aspect of the present disclosure, there is provided a method for processing multidimensional data, comprising: receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics; parsing the first intermediate representation, and converting the first intermediate representation The intermediate representation is converted into a target program, where the target program preserves multidimensional data semantics.
根据本公开的第二方面,提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.
根据本公开的第三方面,提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
在本公开中,提出的张量保持编译(Tensor Intact Compling,TIC)架构能够改进性能、提升效率以及改善可移植性。In the present disclosure, the proposed Tensor Intact Compling (TIC) architecture can improve performance, improve efficiency, and improve portability.
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts wherein:
图1示出了根据本公开的一个实施方式的对多维数据进行处理的方法流程图;FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure;
图2a示出了根据本公开的一个实施方式的方法的流程图;Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure;
图2b示出了根据图2a所示方法的张量保持架构的示意图;Figure 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a;
图3示出了根据本公开的一个实施方式的GIR的示意图;Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure;
图4a示出了根据本公开的一个实施方式的一种对多维数据进行处理的设备;图4b示 出了根据本公开的一个实施方式的第一处理设备所执行的步骤的流程图;Fig. 4a shows a device for processing multi-dimensional data according to an embodiment of the present disclosure; Fig. 4b shows a flowchart of the steps performed by the first processing device according to an embodiment of the present disclosure;
图5示出了根据本公开另一个实施方式的张量保持架构的示意图;5 shows a schematic diagram of a tensor retention architecture according to another embodiment of the present disclosure;
图6a至图6d示出了多种神经网络加速器/处理器的结构示意图;Figures 6a to 6d show schematic structural diagrams of various neural network accelerators/processors;
图7a示出了根据本公开的一个实施方式的虚拟处理器的结构示意图,其对多种神经网络加速器/TPU/GPU等等进行了抽象,从而形成具有公共特征的多维数据处理器;7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., thereby forming a multi-dimensional data processor with common features;
图7b示出了根据本公开的一个实施方式的虚拟处理器更进一步的示意图;Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure;
图7c示出了多层处理器的结构及其功能;Figure 7c shows the structure of the multi-layer processor and its functions;
图8a示出了第二中间表示TIR的一个示例性代码图;Figure 8a shows an exemplary code diagram of the second intermediate representation TIR;
图8b示出了根据本公开的一个实施方式的将所述第二中间表示转换为第一中间表示示意图;Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure;
图9a和图9b示出了传统的神经网络运算以及算子融合之后的神经网络的示意图;Fig. 9a and Fig. 9b show the traditional neural network operation and the schematic diagram of the neural network after operator fusion;
图10示例性地示出了根据本公开的一个实施方式的进行算子融合的示意图;FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure;
图11示出了根据本公开的一个实施方式的当存在多个PSM时数据的存储方式;FIG. 11 shows a storage manner of data when there are multiple PSMs according to one embodiment of the present disclosure;
图12a-图12d示出了多个PSM在进行数据存储时数据的轮换示意图;Figures 12a-12d show schematic diagrams of data rotation when multiple PSMs are performing data storage;
图13a和图13b描述了将并行任务映射到虚拟处理器中并行的处理集群的示意图;Figures 13a and 13b depict schematic diagrams of mapping parallel tasks to parallel processing clusters in virtual processors;
图14示出了TIC的技术与其他基准技术在GPU-TC上的性能表现;Figure 14 shows the performance of TIC's technology and other benchmark technologies on GPU-TC;
图15示出了TIC的技术与其他基准技术在MLU上的性能表现;Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU;
图16示出了TIC的技术与其他基准技术在TPU上的性能表现;Figure 16 shows the performance of TIC's technology and other benchmark technologies on TPU;
图17比较了在使用TVM和TIC的技术以在GPU-TC和MLU上实现卷积操作时LoC的降低;Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU;
图18示出了一种组合处理装置;以及Figure 18 shows a combined processing device; and
图19示出了一种示例性板卡。Figure 19 shows an exemplary board.
具体实施方式Detailed ways
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of this disclosure indicate the presence of the described features, integers, steps, operations, elements and/or components, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确 定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".
传统通用计算平台上发展出了多种不同的变成语言,包括面向特定硬件架构的底层汇编语言(如X86汇编语言、ARM汇编语言以及RISC-V汇编语言等等)方便用户变成的高级语言,以及面向逻辑推理的逻辑式变成语言Prolog等。这些编程语言在深度学习处理器代表的智能细算系统上面临诸多问题。传统编程语言和只能计算系统之间存在三方面的鸿沟:即语义鸿沟,传统编程语言难以高效地描述高层智能计算语义,导致智能应用程序的开发效率较低;二是硬件鸿沟,传统编程语言难以高效地抽象智能计算机硬件特征,导致最终生成的代码执行效率较低;三是平台鸿沟,智能计算硬件平台种类繁多且在不断增长,针对特定平台优化的程序难以实现跨平台移植。A variety of different programming languages have been developed on traditional general computing platforms, including low-level assembly languages for specific hardware architectures (such as X86 assembly language, ARM assembly language, and RISC-V assembly language, etc.) , and the logical expression for logical reasoning becomes the language Prolog, etc. These programming languages face a number of problems with the intelligent accounting systems represented by deep learning processors. There are three gaps between traditional programming languages and computing-only systems: namely, the semantic gap, traditional programming languages are difficult to describe high-level intelligent computing semantics efficiently, resulting in low development efficiency of intelligent applications; the second is the hardware gap, traditional programming languages. It is difficult to efficiently abstract the hardware features of intelligent computers, resulting in low execution efficiency of the final generated code; the third is the platform gap. There are many types of intelligent computing hardware platforms and are constantly growing, and it is difficult to implement cross-platform porting of programs optimized for specific platforms.
图1示出了根据本公开的一个实施方式的对多维数据进行处理的方法流程图。本申请实施例的方法可以应用于一处理器,该处理器上可以运行一编译器、编译组件或者编译程序。编译器、编译组件或者编译程序可以用于执行本方法中的至少一个步骤。FIG. 1 shows a flowchart of a method for processing multidimensional data according to an embodiment of the present disclosure. The methods of the embodiments of the present application can be applied to a processor, and a compiler, a compiling component, or a compiling program can run on the processor. A compiler, compiling component, or compiling program may be used to perform at least one step in the method.
如图1所示,为了减少或者消除以上至少一种鸿沟,该方法可以包括:在操作S110,接收第一中间表示,所述第一中间表示具有多维数据语义;以及在操作S130,解析所述第一中间表示,并将所述第一中间表示转换为目标程序,其中,所述目标程序保留了多维数据语义。As shown in FIG. 1, in order to reduce or eliminate at least one of the above gaps, the method may include: in operation S110, receiving a first intermediate representation, the first intermediate representation has multi-dimensional data semantics; and in operation S130, parsing the A first intermediate representation, and converting the first intermediate representation into a target program, wherein the target program preserves multidimensional data semantics.
如上文所述,本公开的多维数据可以包括向量数据、矩阵数据和张量数据等非标量数据,也可以是任何其他更高维度的数据。本公开的技术方案也可以对标量数据进行处理,这将在后文中进行描述。但需要理解的是,下文将主要以张量数据为例来进行说明。As mentioned above, the multi-dimensional data of the present disclosure may include non-scalar data such as vector data, matrix data, and tensor data, and may also be any other higher-dimensional data. The technical solutions of the present disclosure can also process scalar data, which will be described later. However, it should be understood that the following description will mainly take tensor data as an example.
在上文中,多维数据从接收到进行解析,始终都保持着多维数据的语义,而不是像传统技术中那样多维数据被拆分为多个标量数据的循环。由此,在本公开的方案中,无需先将多维数据拆分为标量数据,再将标量数据组合为多维数据的过程,从而减少了中间的转换过程,提升了运算效率。此外,由于多维数据始终得到保持,对于用户较为直观,因此也方便了用户对数据以及对上述架构进行编辑。更进一步地,语义鸿沟的消除也进一步提升了编程和运算的效率。In the above, the semantics of multi-dimensional data are always maintained from the time of receiving to parsing, rather than a cycle in which multi-dimensional data is split into multiple scalar data as in the traditional technology. Therefore, in the solution of the present disclosure, there is no need to first split multi-dimensional data into scalar data, and then combine the scalar data into multi-dimensional data, thereby reducing intermediate conversion processes and improving computing efficiency. In addition, since the multi-dimensional data is always maintained, it is more intuitive for the user, so it is also convenient for the user to edit the data and the above-mentioned structure. Furthermore, the elimination of the semantic gap further improves the efficiency of programming and operations.
保留多维数据的语义可以通过相应的编程语言来实现,例如具有Conv语义和Tensor(张量)类型的编程语言。通常,相对于传统的C++,Python等语言,Tensor类型的编程语言能够极大地减少编程量,保留了多维数据(例如张量数据)的语义。The semantics of preserving multi-dimensional data can be implemented by corresponding programming languages, such as programming languages with Conv semantics and Tensor (tensor) types. Generally, compared with traditional languages such as C++, Python, etc., Tensor-type programming languages can greatly reduce the amount of programming and preserve the semantics of multi-dimensional data (such as tensor data).
上文所述的目标程序,可以包括神经网络硬件所支持的高级语言,例如上文所述的CUDA C,BANG C等语言。这些目标程序中也能够处理多维数据,并保留了多维数据语义。The target program described above may include high-level languages supported by neural network hardware, such as languages such as CUDA C and BANG C described above. These target programs can also handle multidimensional data and retain multidimensional data semantics.
根据本公开的一个实施方式,所述第一中间表示包括所述多维数据的操作信息、数据属性及维度顺序中的一种或多种。According to an embodiment of the present disclosure, the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
在上文中,操作信息可以是指对数据进行操作的各种信息,例如神经网络的结构、神经网络中的算子、对数据的存取操作、对数据的优化等等,其描述了对数据进行处理、运算的所有相关操作。这里的多维数据的操作信息,仍然包含了多维数据的语义。神经网络的结构可以是指神经网络中各个算子与其他算子之间的相互关系,算子之间输入和输出的关系等等,其描述了神经网络的整体结构。神经网络中的算子可以是描述某个算子的任何 信息,例如算子的烈类型,算子为单一算子还是多个单一算子的组合算子等等。In the above, operation information may refer to various information for operating data, such as the structure of the neural network, operators in the neural network, data access operations, data optimization, etc., which describe the data Perform all related operations of processing and computing. The operation information of multidimensional data here still includes the semantics of multidimensional data. The structure of the neural network can refer to the relationship between each operator and other operators in the neural network, the relationship between the input and output of the operators, etc., which describes the overall structure of the neural network. An operator in a neural network can be any information describing an operator, such as the type of the operator, whether the operator is a single operator or a combination operator of multiple single operators, etc.
数据属性可以描述数据的类型,例如float类型、fix类型等等。需要理解的是,上述的类型仅仅是一种示例,而不是对本公开的限制。Data attributes can describe the type of data, such as float type, fix type, and so on. It should be understood that the above-mentioned types are merely examples and not limitations of the present disclosure.
维度顺序可以是NHWC或者NCHW,对于图像而言,N表示这批图像有几张,H表示图像在竖直方向有多少像素,W表示水平方向像素数,C表示通道数(例如黑白图像的通道数C=1,而RGB彩色图像的通道数C=3)。The order of dimensions can be NHWC or NCHW. For images, N represents how many images there are in this batch, H represents how many pixels the image has in the vertical direction, W represents the number of pixels in the horizontal direction, and C represents the number of channels (such as the channels of black and white images). The number C=1, and the number of channels of the RGB color image C=3).
NHWC的访存局部性更好(每三个输入像素即可得到一个输出像素),NCHW则必须等所有通道输入准备好才能得到最终输出结果,需要占用较大的临时空间。NHWC has better memory access locality (one output pixel can be obtained for every three input pixels), while NCHW must wait for all channel inputs to be ready to obtain the final output result, which requires a large temporary space.
在本公开中,可以根据实际的需要,例如加速器的处理能力以及所兼容的维度顺序来确定适合的格式。In the present disclosure, a suitable format can be determined according to actual needs, such as the processing capability of the accelerator and the compatible dimensional order.
图2a示出了根据本公开的一个实施方式的方法的流程图。Figure 2a shows a flowchart of a method according to an embodiment of the present disclosure.
如图2a所示,其中,将所述第一中间表示转换为目标程序S130包括:在操作S1310,将所述第一中间表示转换为抽象语言表示,其中,所述抽象语言表示包含多维数据语义;以及,在操作S1330,将所述抽象语言表示转换为所述目标程序。As shown in FIG. 2a, wherein, converting the first intermediate representation into a target program S130 includes: in operation S1310, converting the first intermediate representation into an abstract language representation, wherein the abstract language representation includes multi-dimensional data semantics and, in operation S1330, converting the abstract language representation into the target program.
图2b示出了根据图2a所示方法的张量保持架构的示意图。为了便于理解和阅读,本公开的架构称为张量保持架构TIC(Tensor Intact Compling)。Figure 2b shows a schematic diagram of a tensor-holding architecture according to the method shown in Figure 2a. For ease of understanding and reading, the architecture of the present disclosure is called Tensor Intact Compling (TIC).
如图2b所示,该架构可以包括:机器学习应用A 0,A 1,…,A M-1;框架;图形表达的中间表示GIR(Graph Intermedia Representation);张量中间表达模块TIR(Tensor Intermedia Representation,如上文所述的第一中间表示);张量感知语言模块TAL(Tensor Aware Languare,如上文所述的抽象语言表示);张量抽象机器模块TAM(Tensor Abstract Machine);后端的高级语言,例如CUDA C,BANG C,XLA-TPU等等,其中,上述的目标程序可以是指该后端的高级语言;以及机器学习硬件H 0,H 1,…,H M-1As shown in Figure 2b, the architecture may include: machine learning applications A 0 , A 1 , . Representation, the first intermediate representation as described above); tensor-aware language module TAL (Tensor Aware Language, as described above); tensor abstract machine module TAM (Tensor Abstract Machine); back-end high-level language , such as CUDA C, BANG C, XLA-TPU, etc., wherein, the above-mentioned target program may refer to the high-level language of the back end; and machine learning hardware H 0 , H 1 , . . . , H M-1 .
TIR是被设计为满足机器学习的需要,能够表示标量、向量、矩阵和张量操作的一种中间表示。因此,除了常规的标量操作(例如算数操作、逻辑操作、比较操作、存储器操作、函数调用以及条件操作等),TIR还可以提供对向量、矩阵以及张量的描述,从而保持这些数据的语义。TIR is an intermediate representation that is designed to meet the needs of machine learning and can represent scalar, vector, matrix, and tensor operations. Therefore, in addition to regular scalar operations (such as arithmetic operations, logical operations, comparison operations, memory operations, function calls, and conditional operations, etc.), TIR can also provide descriptions of vectors, matrices, and tensors, thereby maintaining the semantics of these data.
根据本公开的一个实施方式,根据图2b所示的架构,本公开的方法还包括:接收第二中间表示,并将所述第二中间表示转换为所述第一中间表示;其中,所述第二中间表示包括以图形表达的中间表示。According to an embodiment of the present disclosure, according to the architecture shown in FIG. 2b, the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.
传统的中间表示可以包括以图形表达的中间表示(Graph Intermediate Representation,GIR),例如,该以图形表达的中间表示可以是深度学习框架Tensorflow解析后获得的计算图中间表示,也可以是神经网络编译框架TVM的计算图中间表示NNVM或Relay等,此处仅以举例说明,并不用于限定本申请的范围。在传统的中间表示IR(例如GIR)生成目标程序的过程中,原始的IR通常会被拆分为多个标量计算的循环的中间表示,之后再根据标量计算的循环的中间表示生成目标程序。由此可见,在传统方法中,需要将张量形式的IR首先转换为标量形式,然后再将标量形式转换为支持张量语义的目标程序。这种需要在张量和标量之间来回进行转换的方式非常冗长并且容易出错。而通过本发明,无需将张量数据转换为标量数据,而是可以保留张量的语义;此外,还可以将例如较大的张量操作转换为小型的张量操作,相较于在传统IR中将张量操作拆分为标量操作,TIR更加直观,从而对用户而言能够提升开发效率。在本公开的技术方案中,张量的语义始终保持, 无需进行张量和标量之间的转换,从而提高了代码编译和转换的效率。The traditional intermediate representation can include a graphical intermediate representation (Graph Intermediate Representation, GIR). For example, the graphical intermediate representation can be a computational graph intermediate representation obtained after parsing by the deep learning framework Tensorflow, or a neural network compilation. The middle of the calculation graph of the framework TVM represents NNVM or Relay, etc., which is only used for illustration here, and is not used to limit the scope of this application. In the process of generating the target program from the traditional intermediate representation IR (eg GIR), the original IR is usually split into multiple intermediate representations of scalar computation loops, and then the target program is generated according to the intermediate representations of the scalar computation loops. It can be seen that, in the traditional method, the IR in the form of tensor needs to be converted into the form of scalar first, and then the scalar form is converted into the target program that supports tensor semantics. This way of converting back and forth between tensors and scalars is very verbose and error prone. With the present invention, it is not necessary to convert tensor data into scalar data, but the semantics of tensors can be preserved; in addition, for example, large tensor operations can be converted into small tensor operations, compared with traditional IR operations. Splitting tensor operations into scalar operations in TIR makes TIR more intuitive, which can improve development efficiency for users. In the technical solution of the present disclosure, the semantics of tensors are always maintained, and there is no need to perform conversion between tensors and scalars, thereby improving the efficiency of code compilation and conversion.
图3示出了根据本公开的一个实施方式的GIR的示意图。Figure 3 shows a schematic diagram of a GIR according to one embodiment of the present disclosure.
根据本公开的一个实施方式,可以通过如下方式来得到第二中间表示:解析神经网络模型文件,所述神经网络模型文件包括所述神经网络的操作节点和拓扑连接关系;根据所述操作节点和所述拓扑连接关系,获得所述第二中间表示。According to an embodiment of the present disclosure, the second intermediate representation can be obtained by: parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network; according to the operation nodes and the topological connection relationship to obtain the second intermediate representation.
这里所述的神经网络模型文件可以是Json文件,其中记载了神经网络的结构、算子以及其他信息,可以从Json文件中获取到神经网络的细节。The neural network model file described here can be a Json file, which records the structure, operators and other information of the neural network, and details of the neural network can be obtained from the Json file.
进一步地,在图3中,x为输入数据,其与权值数据进行卷积运算;卷积运算之后生成的中间数据与数据y(偏置值bias)进行加法操作,最终得到计算结果,其中y是卷积运算中的偏置数据。需要理解的是,图3是一个神经网络的简化表述,而不是对神经网络的限制。Further, in Figure 3, x is the input data, which is subjected to a convolution operation with the weight data; the intermediate data generated after the convolution operation is added with the data y (bias value bias), and finally the calculation result is obtained, wherein y is the bias data in the convolution operation. It should be understood that Figure 3 is a simplified representation of a neural network, not a limitation of neural networks.
下面从代码层面介绍GIR与本公开的TIR之间的区别。图4a和图4b的代码示出了传统的中间表示和本公开的TIR表示的区别。The difference between the GIR and the TIR of the present disclosure is described below from the code level. The codes of Figures 4a and 4b illustrate the difference between the traditional intermediate representation and the TIR representation of the present disclosure.
如图4a所示,一个张量数据在进行计算时被分为多个标量数据,并以例如for循环来进行表示。本质上,在现有技术中,张量计算被划分为标量计算,从而失去了张量数据的语义。这给用户带来了沉重的编程负担,并且计算效率较低。As shown in Fig. 4a, one tensor data is divided into a plurality of scalar data during calculation, and is represented by, for example, a for loop. Essentially, in the prior art, tensor computation is divided into scalar computation, thus losing the semantics of tensor data. This brings a heavy programming burden to the user and is computationally inefficient.
图4b示出了根据本公开的一个实施方式的TIR表示的示例。Figure 4b shows an example of a TIR representation according to one embodiment of the present disclosure.
在图4b中,TIR表示包括的维度顺序可以为NCHW,并且规定了N的大小“batch_size”,C的数值“output_channel”以及高度“height”和宽度“width”。此外,第一中间表示还包括了数据“data”和内核“kernel”以及他们的卷积运算信息和数据的类型(如float 16)。In Figure 4b, the TIR indicates that the order of dimensions included can be NCHW, and specifies the size of N "batch_size", the value of C "output_channel", and the height "height" and width "width". In addition, the first intermediate representation also includes data "data" and kernel "kernel" and their convolution operation information and data type (eg float 16).
由此可见,在本公开的技术方案中,保留了张量数据的语义,并且描述了张量数据的形状以及操作信息等,这简化了运算操作,提升了编程效率。It can be seen that, in the technical solution of the present disclosure, the semantics of the tensor data is preserved, and the shape and operation information of the tensor data are described, which simplifies the operation and improves the programming efficiency.
需要理解的是,图4b中包括张量数据语义方案不仅适用于张量数据,也适应于向量数据和矩阵数据,向量数据为一维数据,矩阵数据为二维数据。保留多维数据语义的编程方式也不仅仅限于图4b所示的方案,任何其他能够保留多维数据语义的方式都包含在本公开的范围内。It should be understood that the semantic scheme of tensor data included in FIG. 4b is not only applicable to tensor data, but also to vector data and matrix data, where vector data is one-dimensional data, and matrix data is two-dimensional data. The programming manner for preserving the semantics of multi-dimensional data is not limited to the scheme shown in FIG. 4b, and any other manners capable of preserving the semantics of multi-dimensional data are included within the scope of the present disclosure.
TAL是一种抽象语言表示,其可以是基于C语言扩展而建立的,并考虑了更底层硬件(例如片上存储器层次结构、控制逻辑以及计算单元等)的特征。这些硬件特征将在后文介绍TAM时进行描述。TAL的目标在于为用户提供备选方案,以便于根据用户的需要来进行修改。TAL可以转换为目标平台的代码,这样的目标平台的特征包括,具有本地张量指令,例如用于GPU-TC的wmma等。其中,将TAL转换为目标平台的代码的过程可以包括:首先将TAL转换为目标程序,再将该目标程序编译为目标平台能够运行的机器指令。TAL is an abstract language representation that can be built upon extensions to the C language, taking into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computational units, etc.). These hardware features will be described later when the TAM is introduced. The goal of TAL is to provide users with alternatives that can be easily modified according to their needs. TAL can be converted to code for target platforms, such target platforms are characterized by having native tensor instructions, such as wmma for GPU-TC, etc. The process of converting the TAL into the code of the target platform may include: firstly converting the TAL into a target program, and then compiling the target program into machine instructions that the target platform can run.
TAM为提供给软件编程用户的编程模型,其包含对硬件加速器的基本抽象。本申请实施例中的TAM可以是对多种神经网络加速器的公共特征进行抽象,其提取了各种机器学习架构中张量处理的各种关键和公共特征。基于该TAM,可以在更高的层执行硬件感知的优化,甚至可以向用户暴露这些特征。由于TAM可以被实例化为不同的具体平台(例如GPU-TC)等,因此可以显著地提升系统的可移植性。The TAM is a programming model provided to users of software programming that contains a basic abstraction for hardware accelerators. The TAM in the embodiments of the present application may abstract common features of various neural network accelerators, and extract various key and common features of tensor processing in various machine learning architectures. Based on this TAM, hardware-aware optimizations can be performed at higher layers, and these features can even be exposed to the user. Since the TAM can be instantiated for different specific platforms (eg GPU-TC), etc., the portability of the system can be significantly improved.
在本公开中,TIR、TAL以及支持TAL的TAM均可以处理多维数据,并且都能够识别多维语义,因此,在上述的转换过程中,无需将诸如张量数据之类的多维数据转换为标 量数据,而是可以直接在多维数据的环境中进行工作,因此减少或消除了将多维数据转换为标量数据和/或将标量数据转换为多维数据的需要。In the present disclosure, TIR, TAL, and TAM supporting TAL can all process multi-dimensional data, and can recognize multi-dimensional semantics. Therefore, in the above conversion process, there is no need to convert multi-dimensional data such as tensor data into scalar data. , but can work directly in the context of multidimensional data, thus reducing or eliminating the need to convert multidimensional data to scalar data and/or convert scalar data to multidimensional data.
需要理解的是,TAM是对多种神经网络加速器的公共特征进行的抽象,因此其也可以被视为特定的神经网络加速器。像普通的神经网络加速器可以运行相应的机器指令一样,该特定的神经网络加速器也可以运行相应的目标程序。需要理解的是,这里的目标程序是一种统称,其可以是用户可编辑的高级语言。在本公开中,TAM就可以对应于具体的神经网络加速器硬件,而TAL可以对应于上述的目标程序。It should be understood that TAM is an abstraction of common features of various neural network accelerators, so it can also be regarded as a specific neural network accelerator. Just like a common neural network accelerator can run corresponding machine instructions, this particular neural network accelerator can also run a corresponding target program. It should be understood that the target program here is a general term, which may be a user-editable high-level language. In the present disclosure, TAM may correspond to specific neural network accelerator hardware, and TAL may correspond to the above-mentioned target program.
在图2b中的图形表达的中间表示(GIR)之上可以是框架(Framework),框架的示例可以包括:Caffe(卷积神经网络框架)、Tensorflow、Mxnet、Pytorch、PaddlePaddle(百度飞浆)等。在框架之上可以是多种机器学习的应用。On top of the intermediate representation (GIR) of the graphical representation in Figure 2b can be a framework, examples of which can include: Caffe (convolutional neural network framework), Tensorflow, Mxnet, Pytorch, PaddlePaddle (Baidu flying pulp), etc. . On top of the framework can be a variety of machine learning applications.
在上述的结构中,TAM可以支持TAL的运行,就像硬件神经网络加速器支持特定的语言一样,例如GPU-TC支持CUDA C,MLU支持Bang C等等。In the above structure, TAM can support the operation of TAL, just like the hardware neural network accelerator supports a specific language, such as GPU-TC supports CUDA C, MLU supports Bang C, and so on.
在图2b所示的结构中,TAL和TAM之下为具体的硬件神经网络加速器以及所支持的目标程序。在图2b所示的示例结构中,机器学习硬件可以示例性地包括H 0,H 1…H M-1,而机器学习硬件所支持的语言可以包括CUDA C、BANG C以及TPU等等,所支持的语言取决于具体的硬件结构。在图2b所示的结构中,抽象语言表示TAL能够很容易地到TAM的各个架构上运行。 In the structure shown in Figure 2b, below TAL and TAM are specific hardware neural network accelerators and supported target programs. In the example structure shown in FIG. 2b, the machine learning hardware may exemplarily include H 0 , H 1 . . . H M-1 , and the languages supported by the machine learning hardware may include CUDA C, BANG C, TPU, etc., so The supported languages depend on the specific hardware architecture. In the structure shown in Fig. 2b, the abstract language means that TAL can easily run on various architectures of TAM.
上文中结合图2a至图4b对本公开的一个实施方式的TIC架构进行了描述。下面描述根据本公开另一个实施方式的TIC架构。The TIC architecture of one embodiment of the present disclosure is described above in conjunction with Figures 2a-4b. A TIC architecture according to another embodiment of the present disclosure is described below.
根据本公开的一个实施方式,将所述第一中间表示转换为目标程序S130可以包括:接收抽象语言表示,并将所述抽象语言表示编译为所述第一中间表示;其中,所述抽象语言表示对于用户可编辑。According to an embodiment of the present disclosure, converting the first intermediate representation into a target program S130 may include: receiving an abstract language representation, and compiling the abstract language representation into the first intermediate representation; wherein the abstract language representation Indicates editable for the user.
图5示出了根据上述方法的张量保持架构的示意图。Figure 5 shows a schematic diagram of a tensor-preserving architecture according to the above method.
如图2b的框架相似,如图5所示的架构可以包括:机器学习应用A 0,A 1,…,A M-1;框架;图形表达的中间表示GIR;张量中间表达模块TIR;张量感知语言模块TAL;张量抽象机器模块TAM;后端的高级机器语言,例如CUDA C,BANG C,XLA-TPU等等;以及机器学习硬件H 0,H 1,…,H M-1Similar to the framework of Figure 2b, the architecture shown in Figure 5 may include: machine learning applications A 0 , A 1 , . . . , A M-1 ; framework; graphical representation intermediate representation GIR; Quantity-aware language module TAL; tensor abstraction machine module TAM; back-end high-level machine languages such as CUDA C, BANG C, XLA-TPU, etc.; and machine learning hardware H 0 , H 1 , …, H M-1 .
根据本公开的一个实施方式,其中,可以通过接收第二中间表示来形成所述抽象语言表示,所述第二中间表示包括以图形表达的中间表示。According to an embodiment of the present disclosure, wherein the abstract linguistic representation may be formed by receiving a second intermediate representation, the second intermediate representation comprising a graphically expressed intermediate representation.
在图5中,可以接收来自于GIR中的多维数据以形成TAL,然后再将抽象转换表示TAL转换为第一中间表示TIR,这与图2b所示的GIR先转换为TIR,再转换为TAL是不同的。在图5所示的实施方式中,可以将TIR中的信息首先在TAL中进行编辑,以形成时候后续处理的形式。例如,可以在TAL中修改从GIR中传导而来的算子。In Figure 5, the multidimensional data from the GIR can be received to form the TAL, and then the abstract transformation representation TAL is transformed into the first intermediate representation TIR, which is similar to the GIR first transformed into TIR and then into TAL shown in Figure 2b is different. In the embodiment shown in FIG. 5 , the information in the TIR can be edited in the TAL first to form the form of subsequent processing. For example, operators derived from GIR can be modified in TAL.
根据本公开的一个变形,在图5中,用户可以在TAL中直接编辑新的算子,而无需从GIR中接收现有的算子。用户可以根据自身的需求来形成框架中没有的新算子。这使得本公开所提供的架构更加灵活,适应性更强。According to a variation of the present disclosure, in Figure 5, the user can directly edit new operators in the TAL without receiving existing operators from the GIR. Users can form new operators that are not in the framework according to their own needs. This makes the architecture provided by the present disclosure more flexible and adaptable.
根据本公开的又一个变形,图5中的TAL还可以直接从框架中接收框架中的多维数据,例如框架中的算子等,而无需从GIR中接收算子。According to yet another variation of the present disclosure, the TAL in FIG. 5 can also directly receive multi-dimensional data in the frame, such as operators in the frame, etc., without receiving operators from the GIR.
图5中的TAL也可以是基于扩展C语言扩展而建立的,它也考虑了更底层硬件(例如片上存储器层次结构、控制逻辑以及计算单元等)的特征。TAL的目标在于为用户提供 备选方案,以便于根据用户的需要来进行修改,例如由用户来定义新的算子或修改现有算子,以进行期望的优化。The TAL in Figure 5 can also be built based on extended C language extensions, which also take into account the characteristics of lower-level hardware (eg, on-chip memory hierarchy, control logic, and computing units, etc.). The goal of TAL is to provide users with alternatives that can be modified according to the user's needs, such as defining new operators or modifying existing operators for the desired optimization.
需要理解的是,与图2b相似,如图5所示,本公开的方法还包括:接收第二中间表示,并将所述第二中间表示转换为所述第一中间表示;其中,所述第二中间表示包括以图形表达的中间表示。It should be understood that, similar to FIG. 2b, as shown in FIG. 5, the method of the present disclosure further includes: receiving a second intermediate representation, and converting the second intermediate representation into the first intermediate representation; wherein, the The second intermediate representation includes a graphical representation of the intermediate representation.
在图5的实施方式中,对于图形中间表示,可以有一部分GIR转换为TAL,然后由用户进行编辑,另一部分可以转换为本公开的TIR。GIR和TIR的区别在上文中已经结合图4a和图4b进行了描述,这里将不再赘述。或者,图5所示的图形中间表示GIR可以全部转换为TIR,用户采用TAL自定义的算子等也可以转换为TIR。应当清楚的是,图5中图形表达的中间表示GIR和TAL之间的连线只是示意性的表达出一种可能的实现方式,该连线并非必然存在的。In the embodiment of FIG. 5, for the graphical intermediate representation, a portion of the GIR may be converted to TAL, which is then edited by the user, and another portion may be converted to the TIR of the present disclosure. The difference between GIR and TIR has been described above with reference to FIG. 4a and FIG. 4b, and will not be repeated here. Alternatively, the middle of the graph shown in Figure 5 indicates that GIR can all be converted to TIR, and the user-defined operator using TAL can also be converted to TIR. It should be clear that the middle of the graphical representation in FIG. 5 represents the connection between GIR and TAL only schematically expressing a possible implementation, and the connection does not necessarily exist.
在图2b和图5的架构中,机器学习硬件可以是各种硬件,例如GPU,MLU等等,每种硬件都有自身的编程语言,例如CUDA C,Bang C等等,这些针对硬件设计的高级编程语言可以作为后端。TAL可以是基于CUDA C、Bang C等针对特定硬件设计的编程语言形成,从而便于用户进行编辑。这有利于最大限度地发挥出硬件的性能。In the architecture of Figure 2b and Figure 5, the machine learning hardware can be various hardware, such as GPU, MLU, etc., each hardware has its own programming language, such as CUDA C, Bang C, etc., which are designed for hardware High-level programming languages can serve as backends. TAL can be formed based on CUDA C, Bang C and other programming languages designed for specific hardware, so as to facilitate users to edit. This helps to maximize the performance of the hardware.
下面介绍多种神经网络加速器的结构,以便于后续对TAM进行更加详细的描述。图6a至图6d示出了多种神经网络加速器/处理器的结构示意图。The structures of various neural network accelerators are introduced below to facilitate a more detailed description of the TAM later. Figures 6a to 6d show schematic diagrams of structures of various neural network accelerators/processors.
在图6a中,该神经网络加速器是Cambricon-ACC的结构示意图,其包括I/O接口电路、控制器、向量SPM存储、矩阵SPM存储、向量功能单元VFU,以及矩阵MFU,此外,为了处理标量数据,还包括了标量功能单元SFU。在图6a所示的加速器中,I/O接收口连接到控制器、SFU、向量SPM、和矩阵SPM,而向量SPM连接到VFU,矩阵SPM连接到MFU。其中向量SPM用于存储和向量数据,而矩阵SPM用于存储矩阵数据,VFU在向量SPM中进行向量数据的存取,而MFU在矩阵SPM中进行矩阵数据的存取。In Figure 6a, the neural network accelerator is a schematic structural diagram of Cambricon-ACC, which includes an I/O interface circuit, a controller, a vector SPM storage, a matrix SPM storage, a vector functional unit VFU, and a matrix MFU. In addition, in order to process scalar data, also includes the scalar functional unit SFU. In the accelerator shown in Figure 6a, the I/O receive ports are connected to the controller, SFU, vector SPM, and matrix SPM, while vector SPM is connected to VFU, and matrix SPM is connected to MFU. The vector SPM is used to store and vector data, and the matrix SPM is used to store the matrix data, the VFU accesses the vector data in the vector SPM, and the MFU accesses the matrix data in the matrix SPM.
图6b是图6a的一个多层结构的示意图。Figure 6b is a schematic diagram of a multilayer structure of Figure 6a.
如图6b所示,该神经网络加速器包括I/O接口电路、控制器、集群存储器以及多个并行的运算组件,每个运算组件包括多个处理单元P0-Pn,每个处理单元包括如图6a所示的向量SPM存储、矩阵SPM存储、向量功能单元VFU,以及矩阵MFU。多个并行的运算组件均连接到集群存储器,并且集群存储器连接到控制器。运算组件、集群存储器以及控制器通过所述I/O接口电路进行数据存取。As shown in Figure 6b, the neural network accelerator includes an I/O interface circuit, a controller, a cluster memory, and a plurality of parallel computing components, each computing component includes a plurality of processing units P0-Pn, and each processing unit includes Vector SPM storage, matrix SPM storage, vector functional unit VFU, and matrix MFU shown in 6a. Multiple parallel computing components are connected to the cluster memory, and the cluster memory is connected to the controller. The computing components, the cluster memory, and the controller perform data access through the I/O interface circuit.
图6c示出了一种张量处理单元TPU的结构。在图6c中,TPU包括I/O接口电路、控制器、统一缓冲器(Unified Buffer)、权值先入先出存储器(Weight FIFO)以及运算组件。I/O接口电路连接到控制器、统一缓冲器以及权值先入先出存储器,运算组件包括MMU、激活组件以及归一化/池化组件等,连接到该统一缓冲器以及权值先入先出存储器,以对这些存储器进行访问。Figure 6c shows the structure of a tensor processing unit TPU. In Figure 6c, the TPU includes an I/O interface circuit, a controller, a unified buffer (Unified Buffer), a weight first-in-first-out memory (Weight FIFO), and an arithmetic component. The I/O interface circuit is connected to the controller, the unified buffer, and the weight FIFO memory. The arithmetic components include MMU, activation components, and normalization/pooling components, etc., which are connected to the unified buffer and weight FIFO. memory to access these memories.
图6d示出了一种GPU-TC的结构。在图6d中,GPU包括I/O接口电路、控制器以及多个运算组件。I/O接口电路与控制器以及每个运算组件连接,控制器与每个运算组件连接。每个运算组件包括共享存储器以及与该共享存储连接的多个张量处理核G0-Gn。Figure 6d shows the structure of a GPU-TC. In Figure 6d, the GPU includes an I/O interface circuit, a controller, and a plurality of computing components. The I/O interface circuit is connected with the controller and each arithmetic component, and the controller is connected with each arithmetic component. Each computing component includes a shared memory and a plurality of tensor processing cores G0-Gn connected to the shared memory.
下面描述对以上加速器/处理器进行抽象得到的虚拟处理器TAM的结构。The structure of the virtual processor TAM obtained by abstracting the above accelerator/processor is described below.
图7a示出了根据本公开的一个实施方式的虚拟处理器的结构示意图,其对多种神经网络加速器/TPU/GPU等等进行了抽象,从而形成具有公共特征的多维数据处理器。7a shows a schematic structural diagram of a virtual processor according to an embodiment of the present disclosure, which abstracts various neural network accelerators/TPUs/GPUs, etc., to form a multi-dimensional data processor with common features.
如图7a所示,虚拟处理器320可以包括I/O接口电路3210、控制电路3220和运算组件3230,所述运算组件可以包括第一存储电路3231和运算电路3233;所述I/O接口电路3210可以配置用于所述虚拟处理器320的输入和输出;所述控制电路3220可配置为通过所述I/O接口电路3210进行存取操作;所述第一存储电路3231配置为通过所述I/O接口电路3210至少读取输入数据和权值数据;所述运算电路3233可配置为从所述第一存储电路3231中读取所述输入数据和权值数据以进行运算。在上文中,控制电路3220连接到I/O接口电路3210以及运算组件3230,以对I/O接口电路3210和运算组件3230进行控制,虚拟处理器320的输入和输出可以是权值数据、输入数据、中间数据、指令、代码等任何的输入和输出。相应的内容输入到第一存储电路3231之后,运算电路3233访问第一存储电路3231并从中读取所需的内容,并将经过运算的内容存入到该第一存储电路3231中。As shown in FIG. 7a, the virtual processor 320 may include an I/O interface circuit 3210, a control circuit 3220 and an operation component 3230, and the operation component may include a first storage circuit 3231 and an operation circuit 3233; the I/O interface circuit 3210 may be configured for input and output of the virtual processor 320; the control circuit 3220 may be configured to perform access operations through the I/O interface circuit 3210; the first storage circuit 3231 may be configured to perform an access operation through the The I/O interface circuit 3210 reads at least input data and weight data; the operation circuit 3233 may be configured to read the input data and weight data from the first storage circuit 3231 for operation. In the above, the control circuit 3220 is connected to the I/O interface circuit 3210 and the arithmetic component 3230 to control the I/O interface circuit 3210 and the arithmetic component 3230. The input and output of the virtual processor 320 can be weight data, input Any input and output of data, intermediate data, instructions, code, etc. After the corresponding content is input to the first storage circuit 3231 , the operation circuit 3233 accesses the first storage circuit 3231 and reads the required content therefrom, and stores the calculated content into the first storage circuit 3231 .
图7b示出了根据本公开的一个实施方式的虚拟处理器更进一步的示意图。Figure 7b shows a further schematic diagram of a virtual processor according to an embodiment of the present disclosure.
如图7b所示,所述第一存储电路3231可以包括:并行权值存储器PWM和并行神经元存储器PNM,所述PWM用于存储权值数据,所述PNM用于存储输入数据;所述运算电路3233可以包括并行功能单元PFU,用于对非标量数据进行运算;所述I/O接口电路3210可以与所述控制电路3220、PWM以及所述PNM连接;所述控制电路3220与所述PWM、所述PNM以及所述PFU连接;所述PWM和所述PNM均连接到所述PFU。As shown in FIG. 7b, the first storage circuit 3231 may include: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data; the operation The circuit 3233 may include a parallel functional unit PFU for performing operations on non-scalar data; the I/O interface circuit 3210 may be connected to the control circuit 3220, PWM and the PNM; the control circuit 3220 is connected to the PWM , the PNM and the PFU are connected; both the PWM and the PNM are connected to the PFU.
在图7b中,进行神经网络运算的输入数据和权值数据可以分别存储在PNM和PWM中,以便于运算电路3233的访问;并行功能单元PFU可以从所述PNM和PWM中提取所需的数据,并可以处理向量数据、矩阵数据以及标量数据,也可以处理更高维度的数据。In Fig. 7b, the input data and weight data for the neural network operation can be stored in PNM and PWM respectively, so as to facilitate the access of the operation circuit 3233; the parallel functional unit PFU can extract the required data from the PNM and PWM , and can handle vector data, matrix data, and scalar data, as well as higher-dimensional data.
更进一步地,如图7b所示,为了处理标量数据,本公开的虚拟处理器进一步包括标量功能单元SFU,所述SFU连接到所述控制电路和所述I/O接口电路,配置为对标量数据进行运算。Further, as shown in FIG. 7b, in order to process scalar data, the virtual processor of the present disclosure further includes a scalar functional unit SFU, the SFU is connected to the control circuit and the I/O interface circuit, and is configured to process scalar data. data to operate.
上文中,图7b介绍了单层处理器的结构及其功能,下面结合图6c来描述多层处理器的结构及其功能。In the above, FIG. 7b introduces the structure and function of the single-layer processor, and the structure and function of the multi-layer processor are described below with reference to FIG. 6c.
图如图7c所示,所述虚拟处理器还包括共享存储电路PSM,所述运算组件的数量为多个;所述PSM配置为通过所述I/O接口电路读取输入数据和权值数据;所述多个运算组件并行地连接到所述PSM,并且配置为从所述共享存储电路读取所述输入数据和权值数据以进行运算。As shown in FIG. 7c, the virtual processor further includes a shared memory circuit PSM, and the number of the arithmetic components is multiple; the PSM is configured to read input data and weight data through the I/O interface circuit ; the plurality of operation components are connected in parallel to the PSM, and are configured to read the input data and weight data from the shared memory circuit for operation.
在图7c中,共享存储电路PSM连接到多个运算组件,运算组件所需的输入数据,权值数据等首先存储在PSM中,然后这些运算组件再从PSM中获取这些数据。PSM对于用户而言是可见的,并且用户可以显性地对PSM进行管理。In Fig. 7c, the shared memory circuit PSM is connected to multiple operation components, the input data, weight data, etc. required by the operation components are first stored in the PSM, and then these operation components obtain these data from the PSM. The PSM is visible to the user, and the user can explicitly manage the PSM.
图7b中所示的单层TAM结构,可以映射到图6a和图6c中所示的单层结构中;图7c中所示的双层TAM结构,可以映射到图6b和图6d中所示的双层结构中。需要理解的是,图7a至图7c所示的结构仅仅是一种示例,其还可以对任何其他的神经网络加速器的结构进行抽象。更进一步地,该TAM也并非保持不变,而是可以根据所要抽象的硬件结构来发生变化。The single-layer TAM structure shown in Fig. 7b can be mapped to the single-layer structure shown in Fig. 6a and Fig. 6c; the double-layer TAM structure shown in Fig. 7c can be mapped to the single-layer TAM structure shown in Fig. 6b and 6d in the two-layer structure. It should be understood that the structures shown in Figs. 7a to 7c are only an example, and the structures of any other neural network accelerators can also be abstracted. Furthermore, the TAM does not remain unchanged, but can be changed according to the hardware structure to be abstracted.
需要理解的是,尽管上面描述了单层处理器和二层处理器结构,但本领域技术人员可以对更多层的处理器进行抽象。It should be understood that although the single-layer processor and the two-layer processor structure are described above, those skilled in the art can abstract more layers of processors.
下面以进行卷积(conv)运算(Conv=输入数据*权值数据)为例来说明数据在TAM中的流动顺序。在单层处理器结构中,操作过程如下:The following takes the convolution (conv) operation (Conv=input data*weight data) as an example to illustrate the flow order of data in the TAM. In a single-tier processor structure, the operation is as follows:
A1)首先,控制电路从外部存储DRAM中读取程序指令;A1) First, the control circuit reads the program instruction from the external storage DRAM;
B1)接下来,从外部存储DRAM读取输入数据到PNM中,读取权值数据到PWM中;B1) Next, read input data from external storage DRAM into PNM, and read weight data into PWM;
C1)PFU从内部存储器PNM和PWM读取输入数据和权值数据,完成一次向量运算,并将中间结果写回到PNM中;C1) PFU reads input data and weight data from internal memory PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;
D1)将PNM上的该次运算的中间结果写回到外部存储DRAM中。D1) Write the intermediate result of this operation on the PNM back to the external storage DRAM.
在整个运算过程中,输入神经元数据的流动走向为DRAM->PNM->PFU->PNM->DRAM。In the whole operation process, the flow of input neuron data is DRAM->PNM->PFU->PNM->DRAM.
而在双层结构中,操作过程如下:In the two-layer structure, the operation process is as follows:
A2)控制电路从外部存储DRAM中读取程序指令;A2) The control circuit reads the program instructions from the external storage DRAM;
B2)从外部存储DRAM读取输入数据和权值数据到PSM中;B2) read input data and weight data from external storage DRAM into PSM;
C2)根据多核的任务划分,将PSM中的数据读取到PNM、PWM中;C2) according to the task division of multi-core, read the data in PSM into PNM, PWM;
D2)PFU从内部存储PNM和PWM读取输入数据和权值数据,完成一次向量运算,并将中间结果写回到PNM中;D2) PFU reads input data and weight data from internal storage PNM and PWM, completes a vector operation, and writes the intermediate result back to PNM;
E2)将PNM上的该次运算的中间结果写回到共享存储的PSM中;E2) the intermediate result of this operation on the PNM is written back to the PSM of the shared storage;
F2)当多个处理核的运算结果都写回PSM后,将全部运算的结果写回到外部存储DRAM中。F2) After the operation results of the multiple processing cores are all written back to the PSM, the results of all the operations are written back to the external storage DRAM.
上文中介绍了GIR与TIR的区别,接下将结合硬件特征来介绍将Graph IR转换为TIR的过程。The difference between GIR and TIR has been introduced above. Next, the process of converting Graph IR to TIR will be introduced in combination with hardware features.
总体而言,在将Graph IR转换为TIR的过程中,需要进行数据的拆分和数据的调度,例如将数据进行拆分以存储到相应的存储器中。本申请实施例中,在将Graph IR转换为TIR的过程可以包括:将第二中间表示的多维数据拆分为至少一个子多维数据,根据各个子多维数据的数据类别(如子多维数据的数据类别包括输入数据和权值),以及各个子多维数据所需参与的运算,确定各个子多维数据的存储空间;生成该子多维数据相关的空间分配指令、访存指令以及运算指令等。其中,该子多维数据相关的空间分配指令、访存指令以及运算指令为第一中间表示定义的指令。In general, in the process of converting Graph IR to TIR, data splitting and data scheduling are required, such as splitting the data to store it in the corresponding memory. In this embodiment of the present application, the process of converting Graph IR to TIR may include: splitting the multidimensional data represented by the second intermediate into at least one sub-multidimensional data, The categories include input data and weights), and the operations that each sub-multidimensional data needs to participate in, determine the storage space of each sub-multidimensional data; generate space allocation instructions, memory access instructions, and operation instructions related to the sub-multidimensional data. Wherein, the space allocation instruction, memory access instruction and operation instruction related to the sub-multidimensional data are the instructions defined by the first intermediate representation.
图8a示出了第一中间表示TIR的一个示例性代码图。Figure 8a shows an exemplary code diagram of the first intermediate representation TIR.
如图8a所示,其中规定了输入数据x,权值数据weight以及临时数据Temp的数据格式,As shown in Figure 8a, the data format of input data x, weight data weight and temporary data Temp is specified,
Produce(){Produce(){
Tensor(fp32)<NCHW>(1,3,224,224)x:x的数据类型为float32,格式为MCHW,大小为(1,3,224,224)Tensor(fp32)<NCHW>(1,3,224,224)x: The data type of x is float32, the format is MCHW, and the size is (1,3,224,224)
Tensor(fp32)<NCHW>(64,3,3,3)y:y的数据类型为float32,格式为MCHW,大小为(64,3,3,3)Tensor(fp32)<NCHW>(64,3,3,3)y: The data type of y is float32, the format is MCHW, and the size is (64,3,3,3)
Tensor(fp32)<NCHW>(1,64,224,224)Temp:Temp的数据类型为float32,格式为MCHW,大小为(1,64,224,224)。Tensor(fp32)<NCHW>(1,64,224,224)Temp: The data type of Temp is float32, the format is MCHW, and the size is (1,64,224,224).
TIR中还给出了对输入数据x、权值数据weight、临时数Temp、偏置数据y,结果数据Result等的存储空间分配情况相关的空间分配指令,例如:TIR also provides space allocation instructions related to the storage space allocation of input data x, weight data weight, temporary data Temp, bias data y, and result data Result, etc., for example:
allocate.pnm x:将数据x存储到pnm存储器中allocate.pnm x: store data x into pnm memory
allocate.pwm Weight:将数据Weight存储到pwm存储器中allocate.pwm Weight: Store the data Weight in the pwm memory
allocate.pnm Temp:将数据Temp存储到pnm存储器中allocate.pnm Temp: store data Temp into pnm memory
allocate.pnm Y:将数据Y存储到pnm存储器中allocate.pnm Y: store data Y into pnm memory
allocate.pnm Result:将数据Result存储到pnm存储器中allocate.pnm Result: Store the data Result in the pnm memory
此外,TIR中还给出了将数据从片外存储器(例如GDRAM)中加载至片内存储器的访存指令,例如指令load x.gdram to x.pnm;TIR还给出了对运算进行描述的运算指令,例如指令In addition, TIR also gives access instructions to load data from off-chip memory (such as GDRAM) to on-chip memory, such as the instruction load x.gdram to x.pnm; TIR also gives a description of the operation. Operation instructions, such as instructions
Conv(x.pnm,Weight.pwm,Temp.pnm)。Conv(x.pnm, Weight.pwm, Temp.pnm).
图8b示出了根据本公开的一个实施方式的将所述第二中间表示转换为第一中间表示示意图。Figure 8b shows a schematic diagram of converting the second intermediate representation to the first intermediate representation according to an embodiment of the present disclosure.
如图8b所示,将所述第二中间表示转换为第一中间表示可以包括:在操作S810,将第二中间表示的多维数据按照存储基本块BBM(Building Block of Memory)的大小来拆分为多个子多维数据,生成空间分配指令和访存指令,以指示根据空间分配指令在共享存储电路PSM中申请相应的存储空间,并根据访存指令将所述多个子多维数据分多次从片外存储器中加载到共享存储电路PSM中;在操作S820,根据计算基本块BBC(Building Block of Computation)来拆分所述PSM中的子多维数据,并生成空间分配指令(如图8a中的allocate.pnm x;allocate.pwm Weight)和访存指令(如图8a中的load x.gdram to x.pnm;load Weight.gdram to Weight.pwm),以指示根据空间分配指令在相应的存储空间(并行神经元存储器PNM和/或并行权值存储器PWM)中申请相应的存储空间,根据相应的访存指令将拆分的子多维数据中的输入数据加载到并行神经元存储器PNM中(allocate.pnm x),将权值数据加载到并行权值存储器PWM中(allocate.pwm Weight);在操作S830,根据运算指令(如图8a中的Conv(x.pnm,Weight.pwm,Temp.pnm))将所述输入数据和权值数据进行运算之后得到中间结果,并将所述中间结果存放到所述PSM中;以及在操作S840,根据访存指令(如图8a中的store Result.pnm to Result.gdram)将所述中间结果存放到所述片外存储器中。As shown in Fig. 8b, converting the second intermediate representation into the first intermediate representation may include: in operation S810, splitting the multidimensional data of the second intermediate representation according to the size of the BBM (Building Block of Memory) For a plurality of sub-multidimensional data, a space allocation instruction and a memory fetch instruction are generated to instruct to apply for corresponding storage space in the shared memory circuit PSM according to the space allocation instruction, and the plurality of sub-multidimensional data are divided into multiple slices according to the memory fetch instruction. The external memory is loaded into the shared storage circuit PSM; in operation S820, the sub-multidimensional data in the PSM is split according to the calculation basic block BBC (Building Block of Computation), and a space allocation instruction (as shown in the allocate in Figure 8a) is generated. .pnm x; allocate.pwm Weight) and memory fetch instructions (load x.gdram to x.pnm; load Weight.gdram to Weight.pwm in Figure 8a) to indicate that the corresponding storage space ( Apply for the corresponding storage space in the parallel neuron memory PNM and/or the parallel weight memory PWM), and load the input data in the split sub-multidimensional data into the parallel neuron memory PNM according to the corresponding memory access instructions (allocate.pnm x), load the weight data into the parallel weight memory PWM (allocate.pwm Weight); in operation S830, according to the operation instruction (Conv(x.pnm, Weight.pwm, Temp.pnm) in Figure 8a) After the input data and the weight data are operated, the intermediate result is obtained, and the intermediate result is stored in the PSM; And in operation S840, according to the memory access instruction (store Result.pnm to Result in Figure 8a) .gdram) to store the intermediate result in the off-chip memory.
上面介绍了具有PSM时将所述第二中间表示转换为第一中间表示的示例性描述。在单级结构下,并不存在PSM,在此情况下,直接将输入数据加载到并行神经元存储器PNM中,将权值数据加载到并行权值存储器PWM中。An exemplary description of the conversion of the second intermediate representation to the first intermediate representation with PSM was presented above. In the single-level structure, there is no PSM. In this case, the input data is directly loaded into the parallel neuron memory PNM, and the weight data is loaded into the parallel weight memory PWM.
图8b中给出了GIR转换为TIR的流程图,在上述的操作中,可以定义两个基本块,即存储基本块和计算基本块;其中,存储基本块BBM可以是数据拷贝的最小粒度,存储基本块的大小是由共享存储电路PSM决定的;计算基本块可以是运算的最小粒度,计算基本块的大小同时受限于PNM、PWM和当前操作的计算指令的约束。首先,编译器可以根据原始数据和存储基本块的大小,进行从片外存储器例如DRAM到PSM维度的拆分,即把DRAM上的原始输入数据张量A,每次只加载BBM到PSM上,加载的次数为张量A/BBM次。Figure 8b shows the flow chart of GIR conversion to TIR. In the above operation, two basic blocks can be defined, namely, the storage basic block and the calculation basic block; wherein, the storage basic block BBM can be the smallest granularity of data copy, The size of the storage basic block is determined by the shared memory circuit PSM; the computing basic block can be the smallest granularity of the operation, and the size of the computing basic block is also limited by the constraints of PNM, PWM and the computing instruction of the current operation. First, the compiler can split the off-chip memory such as DRAM to the PSM dimension according to the original data and the size of the storage basic block, that is, load the original input data tensor A on the DRAM and only load the BBM to the PSM each time. The number of times loaded is tensor A/BBM times.
之后,编译器可以根据存储基本块BBM和计算基本块进行PSM→PWM,PSM→PNM的拆分。在将数据分块加载到PNM、PWM上后,就能通过TIR的定义的运算指令完成运算。接下来,PFU计算完成后的结果按照当前拆分的位置,存到PSM中,然后,将PSM按照当前拆分的位置,存到DRAM中。After that, the compiler can split PSM→PWM, PSM→PNM according to the storage basic block BBM and the calculation basic block. After the data is loaded into the PNM and PWM in blocks, the operation can be completed through the operation instructions defined by the TIR. Next, the result after the PFU calculation is completed is stored in the PSM according to the current split position, and then the PSM is stored in the DRAM according to the current split position.
应当清楚的是,图8a和图8b仅用于说明中间表示的转换过程,上述中间表示中包含的指令仅仅是一种中间表示形式,并非处理器可以执行的硬件指令。图8b的流程中示出的运算过程仅用于说明中间表示的作用,并非用于限定具体的运算过程。例如,一处理 器要实现图8a中所示的卷积运算过程,需将图8a中示出的中间表示转换为具体的机器指令,处理器可以根据具体的机器指令实现上述的运算过程。It should be clear that Figures 8a and 8b are only used to illustrate the conversion process of the intermediate representation, and the instructions contained in the above-mentioned intermediate representation are only a form of intermediate representation, not hardware instructions that can be executed by the processor. The operation process shown in the flowchart of FIG. 8b is only used to illustrate the role of the intermediate representation, and is not used to limit the specific operation process. For example, to implement the convolution operation process shown in Figure 8a, a processor needs to convert the intermediate representation shown in Figure 8a into specific machine instructions, and the processor can implement the above-mentioned operation process according to the specific machine instructions.
根据本公开的一个实施方式,编译器还可以对第一中间表示进行优化,以根据优化的第一中间表示生成目标程序。下面将详细描述对第一中间表示进行优化的多种实施方式。According to an embodiment of the present disclosure, the compiler may further optimize the first intermediate representation to generate an object program according to the optimized first intermediate representation. Various embodiments for optimizing the first intermediate representation will be described in detail below.
TIR的另一个优点在于所提供的架构能够以张量语义的方式来进行潜在的优化操作。传统的以图形表达的中间表示(例如Relay)等仅可以进行图形的优化。如下给出了根据本公开的实施方式的优化方案。Another advantage of TIR is that the provided architecture enables potential optimization operations in a tensor-semantic manner. Traditional graphical intermediate representations (eg Relay) can only be optimized for graphics. The optimization scheme according to the embodiment of the present disclosure is given as follows.
根据本公开的一个实施方式,编译器对第一中间表示进行优化可以包括:将所述多维数据的第一维度顺序转换为第二维度顺序,以适配相应的神经网络加速器。According to an embodiment of the present disclosure, optimizing the first intermediate representation by the compiler may include: converting the first dimension order of the multidimensional data to the second dimension order to adapt to the corresponding neural network accelerator.
上述的优化在主要针对GPU运算。由于TensorFlow中的张量数据默认采用NHWC的格式,而在GPU中使用NVHW会更高效,因此在进行优化时可以使用两个转换节点,即NHWC至NCHW的转换节点,以及NCHW至NHWC的转换节点,两个连续的GPU计算节点之间的连续NHWC至NCHW转换节点以及NCHW至NHWC转换节点之间所发生的转换可以相互抵消。The above optimizations are mainly for GPU operations. Since the tensor data in TensorFlow is in NHWC format by default, and it is more efficient to use NVHW in GPU, two transformation nodes can be used when optimizing, namely the NHWC to NCHW transformation node, and the NCHW to NHWC transformation node , the transitions occurring between consecutive NHWC-to-NCHW transition nodes and NCHW-to-NHWC transition nodes between two consecutive GPU compute nodes can cancel each other out.
根据本公开的一个实施方式,编译器对第一中间表示进行优化还可以包括:将第一算子和第二算子进行算子融合。其中,算子融合也可以称为层融合,可以将神经网络中的多个层融合在一起进行编译生成指令,减少对片外存储器的存取次数,进而提升数据吞吐率。According to an embodiment of the present disclosure, optimizing the first intermediate representation by the compiler may further include: performing operator fusion on the first operator and the second operator. Among them, operator fusion can also be called layer fusion, which can fuse multiple layers in the neural network to compile and generate instructions, reduce the number of accesses to off-chip memory, and thus improve data throughput.
例如,根据本公开的一个实施方式,在将第一算子和第二算子进行算子融合后,第一算子和第二算子的计算过程如下:将所述第一算子和第二算子之间的中间结果存储在片上存储器中,以便于所述第二算子从所述片上存储器中读取数据。For example, according to an embodiment of the present disclosure, after the first operator and the second operator are operator fused, the calculation process of the first operator and the second operator is as follows: The intermediate results between the two operators are stored in the on-chip memory, so that the second operator can read data from the on-chip memory.
图9a和图9b示出了传统的神经网络运算以及算子融合之后的神经网络的示意图。Figures 9a and 9b are schematic diagrams showing traditional neural network operations and a neural network after operator fusion.
如图9a所示,传统的神经网络运算时的数据存取过程如下:As shown in Figure 9a, the data access process in the traditional neural network operation is as follows:
1、从DRAM读取整个计算图的输入数据(也就是第一算子的输入)到PNM,读取第一算子的权值数据到PWM中;1. Read the input data of the entire calculation graph (that is, the input of the first operator) from DRAM to PNM, and read the weight data of the first operator into PWM;
2、PFU运从PNM和PWM读取输入数据和权值数据以完成运算,并将第一算子的结果写回到PNM上;2. The PFU operation reads the input data and weight data from PNM and PWM to complete the operation, and writes the result of the first operator back to the PNM;
3、把第一算子的结果从PNM写回到DRAM中,作为第二算子的输入;3. Write the result of the first operator from PNM back to DRAM as the input of the second operator;
4、从DRAM读取第二算子的输入数据到PNM,读取第二算子的权值数据到PWM中;4. Read the input data of the second operator from DRAM to PNM, and read the weight data of the second operator into PWM;
5、PFU从PNM和PWM读取数据以完成运算,并将第二算子的结果写回到PNM上;5. PFU reads data from PNM and PWM to complete the operation, and writes the result of the second operator back to PNM;
6、把第二算子的结果写回到DRAM中,作为整个计算图的输出。6. Write the result of the second operator back to the DRAM as the output of the entire calculation graph.
在传统的神经网络中,两个算子之间的中间结果需要存储到片外存储器中,当下一个算子进行运算时,需要从片外存储器中读取该中间结果,这将导致每次运算都需要从片外存储器读取数据,这显然降低了整个神经网络的运算速度,对片外存储器的数据存取操作也容易成为神经网络运算速度的瓶颈。In a traditional neural network, the intermediate result between two operators needs to be stored in off-chip memory, and when the next operator performs an operation, the intermediate result needs to be read from the off-chip memory, which will cause each operation All need to read data from the off-chip memory, which obviously reduces the operation speed of the entire neural network, and the data access operation to the off-chip memory is also likely to become a bottleneck for the operation speed of the neural network.
而在本发明中,如图9b所示,可以将中间结果存储在片上存储器(例如SRAM)上,从而提升了数据的存取速度。根据本公开的一个实施方式,所述第一处理器将第一算子和第二算子进行算子融合包括:In the present invention, as shown in FIG. 9b, the intermediate result can be stored in the on-chip memory (eg, SRAM), thereby improving the data access speed. According to an embodiment of the present disclosure, the first processor performing operator fusion on the first operator and the second operator includes:
1、将第一算子的第一输入数据写入到PNM中;1. Write the first input data of the first operator into the PNM;
2、将第一算子的第一权值数据和第二算子的第二权值数据写入到PWM中;2. Write the first weight data of the first operator and the second weight data of the second operator into the PWM;
3、根据所述第一输入数据和第一权值数据来计算得到第一运算结果,并将所述第一运算计结果写入到PNM中;3. Calculate and obtain the first calculation result according to the first input data and the first weight data, and write the first calculation result into the PNM;
4、根据第一运算结果和第二权值数据来计算得到第二运算结果,并将所述第二运算结果写入到PNM中。4. Calculate and obtain a second operation result according to the first operation result and the second weight data, and write the second operation result into the PNM.
上文中描述了一种算子融合的方法,需要理解的是,中间结果也可能大于PNM的容量,从而无法一次性将所有中间结果存储在PNM中。A method of operator fusion is described above. It should be understood that the intermediate results may also be larger than the capacity of the PNM, so that it is impossible to store all the intermediate results in the PNM at one time.
根据本公开的第一个实施方式,如果两个算子中的第一算子的第一输入数据与第一权值数据的第一运算结果大于神经元存储器PNM的容量,则拆分所述第一权值数据以形成多个第一子权值数据,以使得第一输入数据与所述第一子权值数据的第一子运算结果小于所述PNM的容量;将所述第一输入数据与所述多个第一子权值数据依次进行运算,每得到一个第一子运算结果,则将该第一子运算结果存储在所述PNM中,以便于在第二算子中对该第一子运算结果进行运算,以得到第二子运算结果。According to the first embodiment of the present disclosure, if the first operation result of the first input data of the first operator and the first weight data of the two operators is greater than the capacity of the neuron memory PNM, the the first weight data to form a plurality of first sub-weight data, so that the first sub-operation result of the first input data and the first sub-weight data is smaller than the capacity of the PNM; The data and the plurality of first sub-weight data are operated in turn, and each time a first sub-operation result is obtained, the first sub-operation result is stored in the PNM, so that the second operator can perform this operation in the second operator. The first sub-operation result is operated to obtain the second sub-operation result.
图10示例性地示出了根据本公开的一个实施方式的进行算子融合的示意图。FIG. 10 exemplarily shows a schematic diagram of performing operator fusion according to an embodiment of the present disclosure.
如图10所示,最左端示出了输入数据,设图10中的中间结果大于PNM的容量,因此将无法在片上存储器中存储该中间结果。在此情况下,可以将权值数据进行拆分,例如拆分为权值数据1和权值数据2。权值数据1以浅色方块来表示,而权值数据2以深色方框来表示。As shown in Figure 10, the far left shows the input data, and the intermediate result in Figure 10 is assumed to be larger than the capacity of the PNM, so it will not be possible to store the intermediate result in the on-chip memory. In this case, the weight data may be split, for example, into weight data 1 and weight data 2 . Weight data 1 is represented by light squares, while weight data 2 is represented by dark squares.
由此,输入数据首先与权值数据1进行卷积运算(第一算子),如图10中的路线1所示。卷积运算生成的中间数据(如中间结果中的浅色方块所示)存储在片上存储器中。然后,第二算子从片上存储器中读取所存储的中间数据进行运算(如路线2所示),得到的输出结果存储到片外存储器上。Thus, the input data is first subjected to a convolution operation (first operator) with the weight data 1, as shown by the route 1 in FIG. 10 . The intermediate data generated by the convolution operation (shown as the light-colored square in the intermediate result) is stored in on-chip memory. Then, the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 2), and the obtained output result is stored in the off-chip memory.
接下来,输入数据与权值数据2进行卷积运算,如图10的路线3所示。卷积运算生成的中间数据(如中间结果中的深色方块所示)存储在片上存储器中。然后,第二算子从片上存储器读取所存储的中间数据进行运算(如路线4所示),得到的输出结构存储到片外存储器上。Next, a convolution operation is performed on the input data and the weight data 2, as shown in the route 3 of FIG. 10 . The intermediate data generated by the convolution operation (shown as dark squares in the intermediate result) is stored in on-chip memory. Then, the second operator reads the stored intermediate data from the on-chip memory and performs operations (as shown in route 4), and the obtained output structure is stored on the off-chip memory.
从图10以及上面的描述中可以看出,在本实施方式中,中间结果并不是一次性生成的,而是通过将权值数据进行拆分而分块形成的。生成的每个中间结构都可以存储在片上存储器中,第二个算子无需从片外存储器中读取数据,因此减少了对片外存储器的存取次数,提升了运算效率。It can be seen from FIG. 10 and the above description that in this embodiment, the intermediate result is not generated at one time, but is formed by dividing the weight data into blocks. Each generated intermediate structure can be stored in the on-chip memory, and the second operator does not need to read data from the off-chip memory, thus reducing the number of accesses to the off-chip memory and improving the operation efficiency.
图11示出了根据本公开的一个实施方式的当存在多个PSM时数据的存储方式,图12a-图12d示出了多个PSM在进行数据存储时数据的轮换示意图。Fig. 11 shows a data storage manner when there are multiple PSMs according to an embodiment of the present disclosure, and Figs. 12a-12d show schematic diagrams of data rotation when multiple PSMs store data.
根据本公开的一个实施方式,当所述PSM为多个时,所述第一处理器进一步配置为:在操作S1110,将多组权值数据在所述多个PSM中进行多次轮换存储;在操作S1120,每轮换一次,对所述多个PSM中的权值数据进行运算;在操作S1130,当所有PSM中的权值数据轮换完毕之后,将新权值数据读取到所述多个PSM中。According to an embodiment of the present disclosure, when there are multiple PSMs, the first processor is further configured to: in operation S1110, perform multiple rotation storage of multiple sets of weight data in the multiple PSMs; In operation S1120, the weight data in the plurality of PSMs are operated upon every rotation; in operation S1130, after the rotation of the weight data in all the PSMs is completed, new weight data is read into the plurality of PSMs. in PSM.
PFU的存储器(例如PNM和PWM)相对靠近于计算单元,因此,需要仔细地使用该存储器以避免执行流水线停顿。更具体地,编程人员需要计算计算所需的片上缓冲器的大小。如果用于一次计算所需的大小超过了PFU存储器的大小,那么编译过程将会由于错误消息而停止。下面给出了提高PWM利用率的方法,即通过将权值数据保存在PWM 中,可以对输入数据的不同部分进行多次卷积操作,相较于传统方法中需要不断地访问片外存储器,本公开的方法的效率提升了1.6倍。The memory of the PFU (eg PNM and PWM) is relatively close to the computational unit, so this memory needs to be used carefully to avoid execution pipeline stalls. More specifically, the programmer needs to calculate the size of the on-chip buffer required for the computation. If the size required for one calculation exceeds the size of the PFU memory, the compilation process will stop with an error message. The method to improve the utilization of PWM is given below, that is, by storing the weight data in the PWM, multiple convolution operations can be performed on different parts of the input data. Compared with the traditional method, the off-chip memory needs to be accessed continuously. The efficiency of the method of the present disclosure is improved by a factor of 1.6.
具体而言,当一个处理器或者芯片具有多个处理集群时,将可能存在多个PSM存储器,每个PSM存储器可以从片外存储器(例如DRAM)中读取相应的数据,例如权值数据。在图12a至图12d中,示出了四个PSM,分别为PSM0,PSM1,PSM2和PSM3,这些PSM可以存储四组权值数据,分别表示为权值数据A,权值数据B,权值数据C和权值数据D。Specifically, when a processor or chip has multiple processing clusters, there may be multiple PSM memories, and each PSM memory can read corresponding data, such as weight data, from off-chip memory (eg, DRAM). In Figure 12a to Figure 12d, four PSMs are shown, namely PSM0, PSM1, PSM2 and PSM3, these PSMs can store four sets of weight data, respectively expressed as weight data A, weight data B, weight data Data C and weight data D.
首先,如图12a所示,权值数据A存储到PSM0中,权值数据B存储到PSM1中,权值数据C存储到PSM2中,以及权值数据D存储到PSM3中。当PSM存储了上述四组权值数据之后,可以将这些权值数据与输入数据进行运算。First, as shown in FIG. 12a, weight data A is stored in PSM0, weight data B is stored in PSM1, weight data C is stored in PSM2, and weight data D is stored in PSM3. After the PSM stores the above four sets of weight data, these weight data can be operated on with the input data.
当四组权值数据运算完之后,可以将这四组权值数据轮换地存储在PSM中。在图12b中,经过轮换之后,权值数据A从PSM0中转存到PSM1中,权值数据B从PSM1中转存到PSM2中,权值数据C从PSM2中转存到PSM3中,权值数据D从PSM3中转存到PSM0中。After the four groups of weight data are calculated, the four groups of weight data can be stored in the PSM alternately. In Figure 12b, after the rotation, the weight data A is transferred from PSM0 to PSM1, the weight data B is transferred from PSM1 to PSM2, the weight data C is transferred from PSM2 to PSM3, and the weight data is transferred from PSM2 to PSM3. Data D is transferred from PSM3 to PSM0.
在上述的权值数据D,权值数据A,权值数据B和权值数据C运算完毕之后,进行下一次的轮换。After the above-mentioned weight data D, weight data A, weight data B and weight data C are calculated, the next rotation is performed.
在图12c中,权值数据A从PSM1中转存到PSM2中,权值数据B从PSM2中转存到PSM3中,权值数据C从PSM3中转存到PSM0中,权值数据D从PSM0中转存到PSM1中。In Figure 12c, weight data A is transferred from PSM1 to PSM2, weight data B is transferred from PSM2 to PSM3, weight data C is transferred from PSM3 to PSM0, and weight data D is transferred from PSM0 Transfer to PSM1.
在图12d中,权值数据A从PSM2中转存到PSM3中,权值数据B从PSM3中转存到PSM0中,权值数据C从PSM0中转存到PSM1中,权值数据D从PSM1中转存到PSM2中。In Figure 12d, weight data A is transferred from PSM2 to PSM3, weight data B is transferred from PSM3 to PSM0, weight data C is transferred from PSM0 to PSM1, and weight data D is transferred from PSM1 Transfer to PSM2.
在这三次迭代(即轮换)期间,无需与片外存储器进行通信,进行了三次迭代,即三次轮换之后,需要重新加载新的权值数据,在此情况下,可以从片外存储器例如DRAM中读取新的权值数据。During these three iterations (ie, rotation), there is no need to communicate with off-chip memory. After three iterations, that is, three rotations, new weight data needs to be reloaded. In this case, it can be obtained from off-chip memory such as DRAM Read new weight data.
用户可以通过对TAL进行编辑来实现图11和图12a-图12d所示的方法,这通过一种折中的访问延迟,集群存储器可以部分地弥补对PFU存储器的访问(例如10个时钟周期)以及对片外存储器DRAM的访问(例如300个时钟周期)之间的空隙。The user can implement the approach shown in Figures 11 and 12a-12d by editing the TAL, which partially compensates for the accesses to the PFU memory (e.g. 10 clock cycles) with a compromised access latency. And the gap between accesses to off-chip memory DRAM (eg 300 clock cycles).
根据本公开的一个实施方式,可以在TAL处进行处理核同步、处理集群同步和/或芯片同步;和/或将并行任务映射到虚拟处理器中并行的处理集群。According to one embodiment of the present disclosure, processing core synchronization, processing cluster synchronization, and/or chip synchronization can be performed at the TAL; and/or parallel tasks are mapped to parallel processing clusters in virtual processors.
同样地,本实施方式的控制逻辑也可以通过用户对TAL进行编辑来实现。Similarly, the control logic of this embodiment can also be realized by editing the TAL by the user.
在TAL层级暴露出控制逻辑的主要目的在于提供功能正确性和执行效率,其中主要的特征是大量计算单元的同步和并行化。有三种类型的同步,即处理核同步、处理集群同步和/或芯片同步。处理核同步是为了保证不同功能单元管道(例如标量、向量和矩阵)的并行执行的正确性。处理集群包括了多个处理核,处理集群同步是为了保持同一空间出特定集群中所有处理器的同步。芯片同步确保了所有集群将继续执行,处分所有集群到达同步点。需要注意的是,用户可以隐藏处理核同步,以简化程序员的编程负担。一种潜在的优化方法是软件管道化,其可被用于隐藏存储器存取的延迟。The main purpose of exposing control logic at the TAL level is to provide functional correctness and execution efficiency, the main feature of which is the synchronization and parallelization of a large number of computational units. There are three types of synchronization, processing core synchronization, processing cluster synchronization, and/or chip synchronization. Processing core synchronization is to ensure the correctness of parallel execution of pipelines of different functional units such as scalars, vectors, and matrices. The processing cluster includes multiple processing cores, and the processing cluster synchronization is to maintain the synchronization of all processors in a specific cluster in the same space. Chip synchronization ensures that all clusters will continue to execute, disposing of all clusters reaching the synchronization point. It should be noted that the user can hide the processing core synchronization to simplify the programming burden of the programmer. One potential optimization method is software pipelining, which can be used to hide the latency of memory accesses.
下面结合图13a和图13b描述将并行任务映射到虚拟处理器中并行的处理集群。The following describes the mapping of parallel tasks to parallel processing clusters in virtual processors in conjunction with Figures 13a and 13b.
如图13a所示,设主机的内核(kernel)具有任务TaskDim.x=2,TaskDim.y=4,这表 示需要2个处理集群,每个处理集群需要4个处理核。可以将这样的任务映射到TAM中的2个处理集群,即处理集群0和处理集群1,每个处理集群具有4个处理器核,即处理核0,处理核1,处理核2和处理核3。As shown in Fig. 13a, suppose that the kernel of the host has tasks TaskDim.x=2, TaskDim.y=4, which means that 2 processing clusters are required, and each processing cluster requires 4 processing cores. Such a task can be mapped to 2 processing clusters in TAM i.e. processing cluster 0 and processing cluster 1, each processing cluster has 4 processor cores i.e. processing core 0, processing core 1, processing core 2 and processing core 3.
如上文所述,TAM用于对多种神经网络加速器的公共特征进行抽象,其提取了各种ML架构中张量处理的各种关键和公共特征。因此,当将任务映射到TAM中时,TAM可以进一步映射到具体的硬件加速器。As mentioned above, TAM is used to abstract common features of multiple neural network accelerators, which extract various key and common features of tensor processing in various ML architectures. Therefore, when a task is mapped into a TAM, the TAM can be further mapped to specific hardware accelerators.
由此,上述的并行任务可以根据底层的硬件结构映射到具体的硬件加速器。Thus, the above-mentioned parallel tasks can be mapped to specific hardware accelerators according to the underlying hardware structure.
如图13b所示,以GPU-TC和Cambricon-CC为例来进行说明。在GPU-TC中,可以使用两个SM。而尽管每个SM可以包括例如8个张量处理核,但在本任务中,每个流处理器SM仅需要使用4个张量处理核。As shown in Figure 13b, GPU-TC and Cambricon-CC are taken as examples to illustrate. In GPU-TC, two SMs can be used. While each SM may include, for example, 8 tensor processing cores, in this task, each stream processor SM only needs to use 4 tensor processing cores.
在Cambricon-CC中,可以使用两个处理集群,即处理集群0和处理集群1,每个处理集群具有4个处理核,因此在本任务中,每个处理集群可以使用4个处理核。In Cambricon-CC, two processing clusters can be used, namely processing cluster 0 and processing cluster 1, each processing cluster has 4 processing cores, so in this task, each processing cluster can use 4 processing cores.
当任务需要更多的处理集群或者处理核时,则可以以分时的方式来运行,即可以将任务分多次执行。When the task requires more processing clusters or processing cores, it can be run in a time-sharing manner, that is, the task can be divided into multiple executions.
上述控制逻辑,也可以在TAL中实现,从而用户可以根据实际的需求来自定义。The above control logic can also be implemented in TAL, so that users can customize it according to actual needs.
在进行了上述的中间表示的转换、优化、存储分配、加速器硬件的抽象化之后,可以将所生成的代码转换为适配具体硬件的目标程序。After the above-mentioned conversion of the intermediate representation, optimization, storage allocation, and abstraction of accelerator hardware, the generated code can be converted into an object program adapted to specific hardware.
例如,当底层硬件为MLU时,该目标程序可以为Bang C语言以适配MLU硬件;当底层硬件为GPU-TC时,该目标程序可以为CUDA C语言。可以理解的是,当底层硬件为其他加速器时,可以将该目标程序转换为与该加速器适配的机器指令。For example, when the underlying hardware is MLU, the target program can be Bang C language to adapt to the MLU hardware; when the underlying hardware is GPU-TC, the target program can be CUDA C language. It can be understood that when the underlying hardware is other accelerators, the target program can be converted into machine instructions suitable for the accelerator.
本公开的上述技术方案,由于TAM可以被实例化为不同的具体平台(例如MLU、TPU、GPU-TC)等,因此适用于各种加速器,这显著地提升系统的可移植性,有利于在各种加速器的硬件平台上进行移植。The above-mentioned technical solutions of the present disclosure, since the TAM can be instantiated into different specific platforms (such as MLU, TPU, GPU-TC), etc., are applicable to various accelerators, which significantly improves the portability of the system and is beneficial to the Porting on various accelerator hardware platforms.
为了验证TIC的技术方案,采用了三种类型的机器学习计算机,即具有张量处理核的GPU(GPU-TC),MLU以及TPU。用于评估的神经网络算法来自于不同的应用场景,包括ResNet-50和VGG16,ResNet和VGG不但被用于图像分类,还被用作通用特征提取的骨干网络。To verify the technical solution of TIC, three types of machine learning computers are used, namely GPU with tensor processing core (GPU-TC), MLU and TPU. The neural network algorithms used for evaluation come from different application scenarios, including ResNet-50 and VGG16. ResNet and VGG are not only used for image classification, but also as backbone networks for general feature extraction.
本申请针对三个基准进行了对比。第一个基准是TVM堆栈(TVM stack),其通过重写张量原语而直接支持GPU-TC,MLU以及TPU。第二个基准是Glow IR,其包括两层IR,即高层的图形IR(主要用于图形优化)以及低层的指令IR(主要用于存储器相关优化)。第三个基准是TensorFlow框架。上述基准原来主要被设计用于CPU和GPU(没有张量处理核),而被根据本申请的技术方案进行了修改以支持GPU-TC,MLU以及TPU。This application compares against three benchmarks. The first benchmark is the TVM stack, which directly supports GPU-TC, MLU, and TPU by rewriting tensor primitives. The second benchmark is Glow IR, which includes two layers of IR, a high-level graphics IR (mainly for graphics optimization) and a low-level instruction IR (mainly for memory-related optimizations). The third benchmark is the TensorFlow framework. The above benchmarks were originally designed primarily for CPUs and GPUs (without tensor processing cores), but were modified according to the technical solutions of the present application to support GPU-TC, MLU, and TPU.
实验和对比结果主要包括三方面:性能、效率和可移植性。下面将详细进行介绍。The experimental and comparative results mainly include three aspects: performance, efficiency and portability. It will be introduced in detail below.
1、性能1. Performance
GPU-TC:图14示出了TIC的技术与其他基准技术在GPU-TC上的性能表现,其中执行延迟以TensorFlow的延迟为基准进行归一化。相较于TensorFlow的编程架构,TIC的平均性能提升大约为201%。主要原因在于避免了不必要的架构开销,并且在TAL处进行了多种优化。相对于Glow而言,平均的性能提升大约为34.8%,因为Glow的后端是直接通过CUDA C来实现而不是通过高度优化的库来实现的。相较于TVM,平均性能提升大约为13.7%。有益效果主要来源于两个优化过程,即在TIR的帮助下进行的数据维度 顺序优化以及算子融合优化。GPU-TC: Figure 14 shows the performance of TIC’s technique and other benchmark techniques on GPU-TC, where execution latency is normalized to TensorFlow’s latency. Compared to the programming architecture of TensorFlow, the average performance improvement of TIC is about 201%. The main reason is that unnecessary architectural overhead is avoided and multiple optimizations are made at TAL. The average performance gain over Glow is around 34.8% because Glow's backend is implemented directly through CUDA C rather than through a highly optimized library. Compared to TVM, the average performance gain is about 13.7%. The beneficial effects mainly come from two optimization processes, namely, the sequential optimization of data dimensions with the help of TIR and the optimization of operator fusion.
MLU:图15示出了TIC的技术与其他基准技术在MLU上的性能表现。相较于TensorFlow的编程架构,TIC的平均性能大约是TensorFlow性能的96.9%。主要原因在于用于MLU的TensorFlow运行在高度优化的库上,并且更多优化措施可以应用到TIC的技术方案上。例如,在ResNet50中,相较于TensorFlow的性能,公开的技术方案的性能提升大约为41.4%,这是因为在TIC的技术上执行了若干定制的优化。相较于Glow和TVM,性能提升分别大约为23.5%和20.7%,这很好地展现了TIC的技术作为编译架构的效率。MLU: Figure 15 shows the performance of TIC's technique and other benchmark techniques on MLU. Compared to TensorFlow's programming architecture, the average performance of TIC is about 96.9% of TensorFlow's performance. The main reason is that TensorFlow for MLU runs on a highly optimized library, and more optimization measures can be applied to the technical solution of TIC. For example, in ResNet50, the performance improvement of the disclosed technical solution is about 41.4% compared to the performance of TensorFlow, because several customized optimizations are performed on the technology of TIC. Compared to Glow and TVM, the performance gains are about 23.5% and 20.7%, respectively, which well demonstrates the efficiency of TIC's technology as a compilation architecture.
TPU-Lite:图16示出了TIC的技术与其他基准技术在TPU上的性能表现,而原始的Glow在TPU-Lite上无法执行。由于所考虑的TPU原语颗粒度较为粗糙,所以能够在TIC的技术方案上进行的优化非常有限。因此,对于不同的实现而言性能相对较为接近。TPU-Lite: Figure 16 shows the performance of TIC's technique and other benchmark techniques on TPU, while the original Glow cannot perform on TPU-Lite. Due to the relatively coarse granularity of the considered TPU primitives, the optimization that can be performed on the technical solution of TIC is very limited. Therefore, the performance is relatively close for different implementations.
2、效率2. Efficiency
在TIC的技术方案中,可以从不同的角度来评估效率。从使用编程架构来构建ML应用的角度来看,由于保留了编程界面,因此TIC的技术方案的效率与其他基准是相同的。从使用TIR和TAL来构建新操作的角度来看,效率会得到显著的提升,因为张量数据的语义从图形节点到TAM的硬件一直都被保留着。In the technical scheme of TIC, the efficiency can be evaluated from different perspectives. From the perspective of using programming architecture to build ML applications, the efficiency of TIC's technical solution is the same as other benchmarks due to the preservation of the programming interface. From the point of view of using TIR and TAL to build new operations, there is a significant increase in efficiency because the semantics of tensor data is preserved from the graph node to the TAM's hardware all the time.
图17比较了在使用TVM和TIC的技术以在GPU-TC和MLU上实现卷积操作时LoC的降低。可以清楚地看到,LoC在GPU-TC和MUL上分别下降了43%和38%。从使用TAL来直接构架ML应用的角度来看,通过使用标准的C/C++来开发应用就可以展现出较高的效率。一个明显的有点是很多通过C/C++写出的就用应用可以直接地转换为TAL,而无需进行张量相关的优化。Figure 17 compares the reduction in LoC when using the techniques of TVM and TIC to implement convolution operations on GPU-TC and MLU. It can be clearly seen that LoC drops by 43% and 38% on GPU-TC and MUL, respectively. From the perspective of using TAL to directly architect ML applications, developing applications using standard C/C++ can exhibit high efficiency. An obvious advantage is that many ready-to-use applications written in C/C++ can be directly converted to TAL without tensor-related optimizations.
3、可移植性3. Portability
使用量化度量来评估可移植性,这种方法在Stefano Markidis,Steven Wei Der Chien,Erwin Laure,Ivy Bo Peng,and Jeffrey S Vetter.Nvidia tensor core programmability,performance&precision.In IEEE International Parallel and Distributed Processing Symposium Workshops(IPDPSW),pages 522–531.IEEE中进行了介绍。Use quantitative metrics to evaluate portability, an approach described in Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance&precision. In IEEE International Parallel and Distributed Processing Symposium Workshops( IPDPSW), pages 522–531. IEEE.
下表1示出了使用TensorFlow,TVM和本公开的TIC所实现的性能可一致性的对比。量化来看,本公开的TIC的性能相较于TensorFlow和TVM,分别提升了25%和15.4%。Table 1 below shows a comparison of the performance consistency achieved using TensorFlow, TVM and the TIC of the present disclosure. Quantitatively, the performance of the TIC of the present disclosure is improved by 25% and 15.4%, respectively, compared with TensorFlow and TVM.
架构Architecture 可移植性portability
TensorFlowTensorFlow 0.6219%0.6219%
TVMTVM 0.6743%0.6743%
TICTIC 0.7784%0.7784%
表1Table 1
在本公开中,提出了张量保持编译(TIC)架构,用于改进性能、效率以及可移植性。构建TIC的理念是在整个编译过程中保留张量语义,即从上层的编程界面到下层的中间表示以及各种语言,甚至是到下层硬件平台的张量相关指令。整个TIC架构优选地可以包括三个组件,即张量抽象机器模块TAM、张量感知语言模块TAL和张量中间表达模块TIR,他们主要分别用于解决可移植性、性能和效率的问题。程序员可以使用编程架构或者直接使用TAL来作用于TIC,其将被变异成TAM上优化的TAK,甚至,TAL将被编译成不 同目标平台的二进制数。实验结果表明,在GPU-TC,TPU和MLU上,TIC在性能、可移植性以及效率上都优于现有技术。In the present disclosure, a Tensor Preserving Compilation (TIC) architecture is proposed for improving performance, efficiency, and portability. The idea of building TIC is to preserve tensor semantics throughout the compilation process, i.e. from the upper-level programming interface to the lower-level intermediate representation and various languages, and even to the tensor-related instructions of the lower-level hardware platform. The whole TIC architecture can preferably include three components, namely the tensor abstract machine module TAM, the tensor-aware language module TAL and the tensor intermediate expression module TIR, which are mainly used to solve the problems of portability, performance and efficiency, respectively. Programmers can use the programming framework or directly use the TAL to act on the TIC, which will be mutated into an optimized TAK on the TAM, and even the TAL will be compiled into binaries for different target platforms. Experimental results show that TIC outperforms the state-of-the-art in performance, portability, and efficiency on GPU-TC, TPU and MLU.
本公开的实施方式还提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。Embodiments of the present disclosure also provide an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are processed by the one or more processors When the controller runs, the electronic device is caused to perform the method as described above.
根据本公开的另一个实施方式,还提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to another embodiment of the present disclosure, there is also provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
上述的方法和设备还可以实现为一种编译装置,该编译装置可以构成一种组合处理装置。The above-mentioned method and apparatus can also be implemented as a compiling apparatus, and the compiling apparatus can constitute a combined processing apparatus.
图18示出了一种组合处理装置1800,其包括上述的编译装置1802,通用互联接口1804,和其他处理装置1806。根据本公开的编译装置与其他处理装置进行交互,共同完成用户指定的操作。图18为组合处理装置的示意图。FIG. 18 shows a combined processing device 1800 , which includes the above-mentioned compiling device 1802 , a general interconnection interface 1804 , and other processing devices 1806 . The compiling apparatus according to the present disclosure interacts with other processing apparatuses to jointly complete the operation specified by the user. Figure 18 is a schematic diagram of a combined treatment device.
该编译装置可以以软件、硬件等各种方式实现,其可以运行在CPU图形处理器GPU、神经网络处理器等通用/专用处理器中的任何一种或多种上。The compiling apparatus can be implemented in various ways such as software and hardware, and it can run on any one or more of general-purpose/special-purpose processors such as CPU, graphics processor, GPU, and neural network processor.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
通用互联接口,用于在编译装置(包括例如机器学习运算装置)与其他处理装置间传输数据和控制指令。该编译装置从其他处理装置中获取所需的输入数据,写入该编译装置片上的存储装置;可以从其他处理装置中获取控制指令,写入编译装置片上的控制缓存;也可以读取编译装置的存储模块中的数据并传输给其他处理装置。A universal interconnection interface for transferring data and control instructions between a compiling device (including, for example, a machine learning computing device) and other processing devices. The compiling device obtains the required input data from other processing devices and writes it into the storage device on the compiling device chip; it can obtain control instructions from other processing devices and write it into the control cache on the compiling device chip; it can also read the compiling device on-chip The data in the storage module is transmitted to other processing devices.
可选的,该结构还可以包括存储装置1808,存储装置分别与所述编译装置和所述其他处理装置连接。存储装置用于保存在所述编译装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本编译装置或其他处理装置的内部存储中无法全部保存的数据。Optionally, the structure may further include a storage device 1808, and the storage device is respectively connected to the compiling device and the other processing device. The storage device is used to save the data in the compiling device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the compiling device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as a SOC system on a chip for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
在一些实施例里,本披露还公开了一种芯片,其包括了上述的编译装置或组合处理装置。In some embodiments, the present disclosure also discloses a chip, which includes the above-mentioned compiling apparatus or combined processing apparatus.
在一些实施例里,本披露还公开了一种板卡,其包括了上述芯片。参阅图19,其提供了一种示例性的板卡,上述板卡除了包括上述芯片1902以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件1904、接口装置1906和控制器件1908。In some embodiments, the present disclosure also discloses a board including the above chip. Referring to FIG. 19, an exemplary board card is provided. In addition to the above-mentioned chip 1902, the above board card may also include other supporting components, including but not limited to: a storage device 1904, an interface device 1906 and a control Device 1908.
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元1910。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 1910 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上 升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备1912(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本披露并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 1912 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific manifestations of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device is electrically connected to the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit, MCU). For example, the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequence. As in accordance with the present disclosure, certain steps may be performed in other orders or concurrently. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在本披露所提供的几个实施例中,应该理解到,所披露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本披露各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本披露的技术方案可以以软 件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本披露各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which can be a personal computer, a server or a network device) etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
以上对本披露实施例进行了详细介绍,本文中应用了具体个例对本披露的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本披露的方法及其核心思想;同时,对于本领域的一般技术人员,依据本披露的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本披露的限制。The embodiments of the present disclosure have been introduced in detail above, and the principles and implementations of the present disclosure are described in this paper by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, according to the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this description should not be construed as a limitation on the present disclosure.
通过以下条款,可以对本公开的技术方案有更好的理解:The technical solutions of the present disclosure can be better understood through the following terms:
条款1.一种对多维数据进行处理的方法,包括: Clause 1. A method of processing multidimensional data, comprising:
接收第一中间表示,所述第一中间表示具有多维数据语义;receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics;
解析所述第一中间表示,并将所述第一中间表示转换为目标程序,其中,所述目标程序保留了多维数据语义。The first intermediate representation is parsed, and the first intermediate representation is converted into a target program, wherein the target program preserves multidimensional data semantics.
条款2.根据条款1所述的方法,其中,所述第一中间表示包括所述多维数据的操作信息、数据属性及维度顺序中的一种或多种。 Clause 2. The method of Clause 1, wherein the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
条款3.根据条款1或2所述的方法,其中,进一步包括: Clause 3. The method of clause 1 or 2, further comprising:
接收抽象语言表示,并将所述抽象语言表示编译为所述第一中间表示;其中,所述抽象语言表示对于用户可编辑。An abstract language representation is received and compiled into the first intermediate representation; wherein the abstract language representation is editable by a user.
条款4.根据条款3所述的方法,其中,接收第二中间表示,以形成所述抽象语言表示,所述第二中间表示包括以图形表达的中间表示。 Clause 4. The method of clause 3, wherein a second intermediate representation is received to form the abstract linguistic representation, the second intermediate representation comprising a graphically expressed intermediate representation.
条款5.根据条款1或2所述的方法,将所述第一中间表示转换为目标程序包括:Clause 5. The method of clause 1 or 2, converting the first intermediate representation into a target program comprising:
将所述第一中间表示转换为抽象语言表示,其中,所述抽象语言表示包含多维数据语义;以及converting the first intermediate representation to an abstract language representation, wherein the abstract language representation contains multidimensional data semantics; and
将所述抽象语言表示转换为所述目标程序。The abstract language representation is converted into the object program.
条款6.根据条款1-5中任意一项所述的方法,进一步包括:接收第二中间表示,并将所述第二中间表示转换为所述第一中间表示;Clause 6. The method of any of clauses 1-5, further comprising: receiving a second intermediate representation and converting the second intermediate representation to the first intermediate representation;
其中,所述第二中间表示包括以图形表达的中间表示。Wherein, the second intermediate representation includes a graphically expressed intermediate representation.
条款7.根据条款4或6所述的方法,进一步包括:Clause 7. The method of clause 4 or 6, further comprising:
解析神经网络模型文件,所述神经网络模型文件包括所述神经网络的操作节点和拓扑连接关系;Parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network;
根据所述操作节点和所述拓扑连接关系,获得所述第二中间表示。The second intermediate representation is obtained according to the operation node and the topological connection relationship.
条款8.根据条款1-7中任意一项所述的方法,进一步包括:对第一中间表示进行优化,以根据优化的第一中间表示生成目标程序。Clause 8. The method of any of clauses 1-7, further comprising: optimizing the first intermediate representation to generate an object program from the optimized first intermediate representation.
条款9.根据条款8所述的方法,其中,对第一中间表示进行优化包括:Clause 9. The method of Clause 8, wherein optimizing the first intermediate representation comprises:
将所述多维数据的第一维度顺序转换为第二维度顺序,以适配相应的神经网络加速器;和/或将第一算子和第二算子进行算子融合。Converting the order of the first dimension of the multi-dimensional data to the order of the second dimension so as to adapt to the corresponding neural network accelerator; and/or performing operator fusion of the first operator and the second operator.
条款10.根据条款1-9中任意一项所述的方法,其中,所述抽象语言表示基于虚拟处理器形成,所述虚拟处理器包括: Clause 10. The method of any of clauses 1-9, wherein the abstract language representation is formed based on a virtual processor comprising:
I/O接口电路、控制电路和运算组件,所述运算组件包括第一存储电路和运算电路;An I/O interface circuit, a control circuit and an arithmetic component, the arithmetic component includes a first storage circuit and an arithmetic circuit;
所述I/O接口电路配置用于所述虚拟处理器的输入和输出;the I/O interface circuit is configured for input and output of the virtual processor;
所述控制电路配置为通过所述I/O接口电路进行存取操作;the control circuit is configured to perform an access operation through the I/O interface circuit;
所述第一存储电路配置为通过所述I/O接口电路至少读取输入数据和权值数据;The first storage circuit is configured to read at least input data and weight data through the I/O interface circuit;
所述运算电路配置为从所述第一存储电路中读取所述输入数据和权值数据以进行运算。The operation circuit is configured to read the input data and weight data from the first storage circuit for operation.
条款11.根据条款10所述的方法,其中,Clause 11. The method of clause 10, wherein,
所述第一存储电路包括:并行权值存储器PWM和并行神经元存储器PNM,所述PWM用于存储权值数据,所述PNM用于存储输入数据;The first storage circuit includes: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data;
所述运算电路包括并行功能单元PFU,用于对非标量数据进行运算;The operation circuit includes a parallel functional unit PFU for performing operations on non-scalar data;
所述I/O接口电路与所述控制电路、PWM以及所述PNM连接;所述控制电路与所述PWM、所述PNM以及所述PFU连接;所述PWM和所述PNM均连接到所述PFU。The I/O interface circuit is connected to the control circuit, the PWM and the PNM; the control circuit is connected to the PWM, the PNM and the PFU; the PWM and the PNM are both connected to the PFU.
条款12.根据条款10或11所述的方法,其中,所述虚拟处理器进一步包括标量功能单元SFU,所述SFU连接到所述控制电路和所述I/O接口电路,配置为对标量数据进行运算。Clause 12. The method of clause 10 or 11, wherein the virtual processor further comprises a scalar functional unit SFU, the SFU connected to the control circuit and the I/O interface circuit, configured to process scalar data perform operations.
条款13.根据条款10-12中任意一项所述的方法,其中,所述虚拟处理器还包括共享存储电路PSM,所述运算组件的数量为多个;Clause 13. The method of any one of clauses 10-12, wherein the virtual processor further comprises a shared memory circuit PSM, and the number of the arithmetic components is plural;
所述PSM配置为通过所述I/O接口电路读取输入数据和权值数据;The PSM is configured to read input data and weight data through the I/O interface circuit;
所述多个运算组件并行地连接到所述PSM,并且配置为从所述共享存储电路读取所述输入数据和权值数据以进行运算。The plurality of operation components are connected in parallel to the PSM and are configured to read the input data and weight data from the shared memory circuit for operation.
条款14.根据条款6所述的方法,其中,将所述第二中间表示转换为第一中间表示包括:Clause 14. The method of clause 6, wherein converting the second intermediate representation to the first intermediate representation comprises:
将所述第二中间表示的多维数据按照存储基本块BBM的大小来拆分为多个子多维数据,将所述多个子多维数据分多次从片外存储器中加载到共享存储电路PSM中;The multi-dimensional data of the second intermediate representation is divided into a plurality of sub-multi-dimensional data according to the size of the storage basic block BBM, and the plurality of sub-multi-dimensional data is divided into the shared storage circuit PSM from the off-chip memory multiple times;
根据计算基本块BBC来拆分所述PSM中的子多维数据,并将拆分的子多维数据中的输入数据加载到并行神经元存储器PNM,将权值数据加载到并行权值存储器PWM中;Split the sub-multidimensional data in the PSM according to the calculation basic block BBC, load the input data in the split sub-multidimensional data into the parallel neuron memory PNM, and load the weight data into the parallel weight memory PWM;
将所述输入数据和权值数据进行运算之后得到中间结果,并将所述中间结果存放到所述PSM中;以及After the input data and the weight data are operated, an intermediate result is obtained, and the intermediate result is stored in the PSM; and
将所述中间结果存放到所述片外存储器中。The intermediate results are stored in the off-chip memory.
条款15.一种电子设备,包括:Clause 15. An electronic device comprising:
一个或多个处理器;以及one or more processors; and
存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如条款1-14中任意一项所述的方法。a memory having computer-executable instructions stored therein, which, when executed by the one or more processors, cause the electronic device to perform as described in any of clauses 1-14 method.
条款16.一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如条款1-14中任意一项所述的方法。Clause 16. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of clauses 1-14.

Claims (16)

  1. 一种对多维数据进行处理的方法,包括:A method of processing multidimensional data, comprising:
    接收第一中间表示,所述第一中间表示具有多维数据语义;receiving a first intermediate representation, the first intermediate representation having multidimensional data semantics;
    解析所述第一中间表示,并将所述第一中间表示转换为目标程序,其中,所述目标程序保留了多维数据语义。The first intermediate representation is parsed, and the first intermediate representation is converted into a target program, wherein the target program preserves multidimensional data semantics.
  2. 根据权利要求1所述的方法,其中,所述第一中间表示包括所述多维数据的操作信息、数据属性及维度顺序中的一种或多种。The method of claim 1, wherein the first intermediate representation includes one or more of operational information, data attributes, and dimensional order of the multidimensional data.
  3. 根据权利要求1或2所述的方法,其中,进一步包括:The method of claim 1 or 2, further comprising:
    接收抽象语言表示,并将所述抽象语言表示编译为所述第一中间表示;其中,所述抽象语言表示对于用户可编辑。An abstract language representation is received and compiled into the first intermediate representation; wherein the abstract language representation is editable by a user.
  4. 根据权利要求3所述的方法,其中,接收第二中间表示,以形成所述抽象语言表示,所述第二中间表示包括以图形表达的中间表示。4. The method of claim 3, wherein a second intermediate representation is received to form the abstract linguistic representation, the second intermediate representation comprising a graphically expressed intermediate representation.
  5. 根据权利要求1或2所述的方法,将所述第一中间表示转换为目标程序包括:The method according to claim 1 or 2, converting the first intermediate representation into an object program comprises:
    将所述第一中间表示转换为抽象语言表示,其中,所述抽象语言表示包含多维数据语义;以及converting the first intermediate representation to an abstract language representation, wherein the abstract language representation contains multidimensional data semantics; and
    将所述抽象语言表示转换为所述目标程序。The abstract language representation is converted into the object program.
  6. 根据权利要求1-5中任意一项所述的方法,进一步包括:接收第二中间表示,并将所述第二中间表示转换为所述第一中间表示;The method of any of claims 1-5, further comprising: receiving a second intermediate representation and converting the second intermediate representation to the first intermediate representation;
    其中,所述第二中间表示包括以图形表达的中间表示。Wherein, the second intermediate representation includes a graphically expressed intermediate representation.
  7. 根据权利要求4或6所述的方法,进一步包括:The method according to claim 4 or 6, further comprising:
    解析神经网络模型文件,所述神经网络模型文件包括所述神经网络的操作节点和拓扑连接关系;Parsing a neural network model file, where the neural network model file includes operation nodes and topological connection relationships of the neural network;
    根据所述操作节点和所述拓扑连接关系,获得所述第二中间表示。The second intermediate representation is obtained according to the operation node and the topological connection relationship.
  8. 根据权利要求1-7中任意一项所述的方法,进一步包括:对第一中间表示进行优化,以根据优化的第一中间表示生成目标程序。The method according to any one of claims 1-7, further comprising: optimizing the first intermediate representation to generate an object program based on the optimized first intermediate representation.
  9. 根据权利要求8所述的方法,其中,对第一中间表示进行优化包括:9. The method of claim 8, wherein optimizing the first intermediate representation comprises:
    将所述多维数据的第一维度顺序转换为第二维度顺序,以适配相应的神经网络加速器;和/或将第一算子和第二算子进行算子融合。Converting the order of the first dimension of the multi-dimensional data to the order of the second dimension so as to adapt to the corresponding neural network accelerator; and/or performing operator fusion of the first operator and the second operator.
  10. 根据权利要求1-9中任意一项所述的方法,其中,所述抽象语言表示基于虚拟处理器形成,所述虚拟处理器包括:The method of any of claims 1-9, wherein the abstract language representation is formed based on a virtual processor comprising:
    I/O接口电路、控制电路和运算组件,所述运算组件包括第一存储电路和运算电路;An I/O interface circuit, a control circuit and an arithmetic component, the arithmetic component includes a first storage circuit and an arithmetic circuit;
    所述I/O接口电路配置用于所述虚拟处理器的输入和输出;the I/O interface circuit is configured for input and output of the virtual processor;
    所述控制电路配置为通过所述I/O接口电路进行存取操作;the control circuit is configured to perform an access operation through the I/O interface circuit;
    所述第一存储电路配置为通过所述I/O接口电路至少读取输入数据和权值数据;The first storage circuit is configured to read at least input data and weight data through the I/O interface circuit;
    所述运算电路配置为从所述第一存储电路中读取所述输入数据和权值数据以进行运算。The operation circuit is configured to read the input data and weight data from the first storage circuit for operation.
  11. 根据权利要求10所述的方法,其中,The method of claim 10, wherein,
    所述第一存储电路包括:并行权值存储器PWM和并行神经元存储器PNM,所述PWM用于存储权值数据,所述PNM用于存储输入数据;The first storage circuit includes: a parallel weight memory PWM and a parallel neuron memory PNM, where the PWM is used to store weight data, and the PNM is used to store input data;
    所述运算电路包括并行功能单元PFU,用于对非标量数据进行运算;The operation circuit includes a parallel functional unit PFU for performing operations on non-scalar data;
    所述I/O接口电路与所述控制电路、PWM以及所述PNM连接;所述控制电路与所述PWM、所述PNM以及所述PFU连接;所述PWM和所述PNM均连接到所述PFU。The I/O interface circuit is connected to the control circuit, the PWM and the PNM; the control circuit is connected to the PWM, the PNM and the PFU; the PWM and the PNM are both connected to the PFU.
  12. 根据权利要求10或11所述的方法,其中,所述虚拟处理器进一步包括标量功能单元SFU,所述SFU连接到所述控制电路和所述I/O接口电路,配置为对标量数据进行运算。11. The method of claim 10 or 11, wherein the virtual processor further comprises a scalar functional unit SFU, connected to the control circuit and the I/O interface circuit, configured to operate on scalar data .
  13. 根据权利要求10-12中任意一项所述的方法,其中,所述虚拟处理器还包括共享存储电路PSM,所述运算组件的数量为多个;The method according to any one of claims 10-12, wherein the virtual processor further comprises a shared memory circuit PSM, and the number of the arithmetic components is multiple;
    所述PSM配置为通过所述I/O接口电路读取输入数据和权值数据;The PSM is configured to read input data and weight data through the I/O interface circuit;
    所述多个运算组件并行地连接到所述PSM,并且配置为从所述共享存储电路读取所述输入数据和权值数据以进行运算。The plurality of operation components are connected in parallel to the PSM and are configured to read the input data and weight data from the shared memory circuit for operation.
  14. 根据权利要求6所述的方法,其中,将所述第二中间表示转换为第一中间表示包括:6. The method of claim 6, wherein converting the second intermediate representation to the first intermediate representation comprises:
    将所述第二中间表示的多维数据按照存储基本块BBM的大小来拆分为多个子多维数据,将所述多个子多维数据分多次从片外存储器中加载到共享存储电路PSM中;The multi-dimensional data of the second intermediate representation is divided into a plurality of sub-multi-dimensional data according to the size of the storage basic block BBM, and the plurality of sub-multi-dimensional data is divided into the shared storage circuit PSM from the off-chip memory multiple times;
    根据计算基本块BBC来拆分所述PSM中的子多维数据,并将拆分的子多维数据中的输入数据加载到并行神经元存储器PNM,将权值数据加载到并行权值存储器PWM中;Split the sub-multidimensional data in the PSM according to the calculation basic block BBC, load the input data in the split sub-multidimensional data into the parallel neuron memory PNM, and load the weight data into the parallel weight memory PWM;
    将所述输入数据和权值数据进行运算之后得到中间结果,并将所述中间结果存放到所述PSM中;以及After the input data and the weight data are operated, an intermediate result is obtained, and the intermediate result is stored in the PSM; and
    将所述中间结果存放到所述片外存储器中。The intermediate results are stored in the off-chip memory.
  15. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;以及one or more processors; and
    存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如权利要求1-14中任意一项所述的方法。a memory having computer-executable instructions stored therein which, when executed by the one or more processors, cause the electronic device to perform the performance of any one of claims 1-14 Methods.
  16. 一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如权利要求1-14中任意一项所述的方法。A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any of claims 1-14.
PCT/CN2021/123569 2020-10-16 2021-10-13 Device and method for processing multi-dimensional data, and computer program product WO2022078400A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011112090.5A CN114385867A (en) 2020-10-16 2020-10-16 Apparatus, method and computer program product for processing multidimensional data
CN202011112090.5 2020-10-16

Publications (1)

Publication Number Publication Date
WO2022078400A1 true WO2022078400A1 (en) 2022-04-21

Family

ID=81193962

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123569 WO2022078400A1 (en) 2020-10-16 2021-10-13 Device and method for processing multi-dimensional data, and computer program product

Country Status (2)

Country Link
CN (1) CN114385867A (en)
WO (1) WO2022078400A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132150A (en) * 2024-05-07 2024-06-04 中科寒武纪科技股份有限公司 Data access mode deducing method of calculation graph and related product

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024065860A1 (en) * 2022-10-01 2024-04-04 Intel Corporation Hardware support for n-dimensional matrix load and store instructions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392296A1 (en) * 2019-06-28 2019-12-26 John Brady Hardware agnostic deep neural network compiler
CN110929850A (en) * 2019-11-26 2020-03-27 国家超级计算无锡中心 Deep learning operator automatic optimization system and method based on Shenwei processor
CN111104120A (en) * 2018-10-29 2020-05-05 赛灵思公司 Neural network compiling method and system and corresponding heterogeneous computing platform
WO2020093304A1 (en) * 2018-11-08 2020-05-14 北京比特大陆科技有限公司 Method, apparatus, and device for compiling neural network, storage medium, and program product

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111104120A (en) * 2018-10-29 2020-05-05 赛灵思公司 Neural network compiling method and system and corresponding heterogeneous computing platform
WO2020093304A1 (en) * 2018-11-08 2020-05-14 北京比特大陆科技有限公司 Method, apparatus, and device for compiling neural network, storage medium, and program product
US20190392296A1 (en) * 2019-06-28 2019-12-26 John Brady Hardware agnostic deep neural network compiler
CN110929850A (en) * 2019-11-26 2020-03-27 国家超级计算无锡中心 Deep learning operator automatic optimization system and method based on Shenwei processor

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118132150A (en) * 2024-05-07 2024-06-04 中科寒武纪科技股份有限公司 Data access mode deducing method of calculation graph and related product

Also Published As

Publication number Publication date
CN114385867A (en) 2022-04-22

Similar Documents

Publication Publication Date Title
Kwon et al. Beyond the memory wall: A case for memory-centric hpc system for deep learning
Gu et al. Biscuit: A framework for near-data processing of big data workloads
CN110704360B (en) Graph calculation optimization method based on heterogeneous FPGA data flow
WO2021000970A1 (en) Deep learning algorithm compiling method, device, and related product.
Grossman et al. Hadoopcl: Mapreduce on distributed heterogeneous platforms through seamless integration of hadoop and opencl
WO2022078400A1 (en) Device and method for processing multi-dimensional data, and computer program product
CN112465108A (en) Neural network compiling method for storage and calculation integrated platform
US20230124520A1 (en) Task execution method and storage device
CN109918199B (en) GPU-based distributed graph processing system
WO2020083050A1 (en) Data stream processing method and related device
WO2023093623A1 (en) Computation graph optimization method, data processing method and related product
WO2021000971A1 (en) Method and device for generating operation data and related product
US20220188614A1 (en) Fractal calculating device and method, integrated circuit and board card
CN115033188B (en) Storage hardware acceleration module system based on ZNS solid state disk
TWI754310B (en) System and circuit of pure functional neural network accelerator
WO2022253075A1 (en) Compilation method and related apparatus
CN111831582A (en) Memory management device and method for intelligent processor and electronic equipment
Liu et al. Accelerating large-scale DEVS-based simulation on the cell processor
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN111831333A (en) Instruction decomposition method and device for intelligent processor and electronic equipment
Guo et al. Fused DSConv: Optimizing sparse CNN inference for execution on edge devices
CN112214443A (en) Secondary unloading device and method arranged in graphic processor
Kabrick et al. CODIR: towards an MLIR codelet model dialect
Shafiq et al. Automated flow for compressing convolution neural networks for efficient edge-computation with FPGA
Du et al. Breaking the interaction wall: A DLPU-centric deep learning computing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21879443

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21879443

Country of ref document: EP

Kind code of ref document: A1