WO2023030507A1 - 编译优化方法、装置、计算机设备以及存储介质 - Google Patents

编译优化方法、装置、计算机设备以及存储介质 Download PDF

Info

Publication number
WO2023030507A1
WO2023030507A1 PCT/CN2022/116879 CN2022116879W WO2023030507A1 WO 2023030507 A1 WO2023030507 A1 WO 2023030507A1 CN 2022116879 W CN2022116879 W CN 2022116879W WO 2023030507 A1 WO2023030507 A1 WO 2023030507A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
syntax tree
abstract syntax
tensor data
data
Prior art date
Application number
PCT/CN2022/116879
Other languages
English (en)
French (fr)
Inventor
陈峋宇
曹博
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111033876.2A external-priority patent/CN115840894A/zh
Priority claimed from CN202111033297.8A external-priority patent/CN115756722A/zh
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023030507A1 publication Critical patent/WO2023030507A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines

Definitions

  • the present disclosure relates to the field of computer technology, in particular to a compiling optimization method, device, computer equipment and storage medium.
  • processing data in the tasks which may be numerical values and tensors, where tensors can be Including one-dimensional tensors (ie vectors), two-dimensional tensors (ie matrices), and higher-dimensional tensors), etc.
  • tensors can be Including one-dimensional tensors (ie vectors), two-dimensional tensors (ie matrices), and higher-dimensional tensors), etc.
  • the processing of multi-dimensional tensors can usually be processed through multi-dimensional loops, but the processing of multi-dimensional loops introduces more control flow, increases the occupation of processing resources, and cannot make full use of processing processor and memory bandwidth, resulting in lower processing efficiency.
  • tensor data is widely used in various application scenarios, especially computing scenarios in the field of artificial intelligence including deep machine learning.
  • tensors can be used to express a variable or constant, which can have zero or more dimensions.
  • a zero-dimensional tensor is a scalar (such as a value), that is, a constant;
  • a one-dimensional tensor is a vector of combinations of values and directions;
  • a two-dimensional tensor is a combination of vectors (that is, a matrix);
  • a three-dimensional tensor is a data cube (that is, a matrix combinations); and 4D tensors can be combinations of data cubes, and so on.
  • its dimension value can be a variable, that is, the tensor data can be variable, and its dimension setting and data "placement" based on dimension setting will affect the data
  • the storage method and the reading method have an important impact, and thus significantly affect the computing efficiency and the overall performance of the computing platform. In view of this, how to perform dimension processing on multi-dimensional tensor data to achieve efficient data reading has become a technical problem that needs to be solved urgently.
  • a compilation optimization method including: acquiring dimension information of tensor data to be processed, wherein the dimension information includes a dimension value of at least one dimension of the tensor data to be processed; According to the dimension information of the tensor data to be processed, determine the dimension folding information of the tensor data to be processed; according to the dimension folding information of the tensor data to be processed, perform dimension folding on the tensor data to be processed.
  • the present disclosure provides a method for processing multidimensional tensor data, the method being executed by a processor, and comprising obtaining a first expression associated with a first dimension in the multidimensional tensor data ; Acquiring a second expression associated with the second dimension in the multidimensional tensor data, wherein the first dimension and the second dimension are adjacent dimensions of the multidimensional tensor data; and judging the first expression and the second Expressions are equal, so as to determine whether the first dimension and the second dimension can perform dimension folding.
  • the third aspect of the present disclosure also provides a computer device, including: a processor; a memory for storing a computer-executable program for the processor; wherein the processor is configured to execute the computer-executable program to achieve the above-mentioned any one of the methods described.
  • the fourth aspect of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, wherein the computer program implements any of the methods described above when executed by a processor.
  • the present disclosure provides an integrated circuit device comprising: a processor for performing computational tasks associated with a neural network model; and a memory storing program instructions associated with the neural network model for performing Binary program instructions obtained after compilation, wherein the neural network model is optimized according to the methods discussed in the above and below multiple embodiments, wherein when the binary program instructions are run by the processor, the execution is the same as the The computing task associated with the neural network model.
  • the compilation and optimization method of the present disclosure determines the dimension folding information of the tensor (including the dimension folding direction and foldable dimensions, etc.) through the dimension information of the tensor data, and performs dimension folding on the tensor data according to the tensor dimension folding information, In this way, the read and write access of tensor data can be optimized, and the data processing process of the target processor on the tensor data during the running phase can be optimized.
  • the compilation optimization method of the present disclosure can reduce the number of loops in the program, reduce the control flow in the program, and improve the execution efficiency of the program.
  • multi-dimensional tensor data can be processed in terms of dimensions, so that dimension folding of multi-dimensional tensor data can be realized.
  • the solution disclosed in the present disclosure can realize the "dimension reduction" operation on multi-dimensional tensor data, so as to facilitate data handling or migration between on-chip and off-chip hardware platforms, and maximize data transmission bandwidth.
  • the utilization rate of the data transmission bandwidth can be significantly improved, and thus the overall performance of the hardware platform during operation can be improved.
  • FIG. 1 is a schematic diagram of a processor for executing a compilation optimization method in an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a compiling optimization method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of tensor data dimension folding according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of judging whether the dimension of tensor data is continuous through a syntax tree in an embodiment of the present disclosure
  • FIG. 5 is a flowchart of a compilation optimization method according to another embodiment of the present disclosure.
  • FIG. 6 is a structural diagram illustrating a board according to an embodiment of the present disclosure.
  • FIG. 7 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram showing the internal structure of a single-core computing device according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram showing the internal structure of a multi-core computing device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure.
  • FIG. 11 is a structural diagram showing a software and hardware architecture according to an embodiment of the present disclosure.
  • FIG. 12 is a flowchart illustrating a method for processing multi-dimensional tensor data according to an embodiment of the present disclosure
  • FIG. 13 is a schematic diagram illustrating an embodiment of converting an abstract syntax tree according to the disclosed scheme
  • FIG. 14 is a schematic diagram illustrating yet another embodiment of converting an abstract syntax tree according to the disclosed scheme.
  • Fig. 15 is a schematic diagram illustrating yet another embodiment of converting an abstract syntax tree according to the disclosed scheme
  • Fig. 16 is a schematic diagram showing yet another embodiment of converting an abstract syntax tree according to the disclosed scheme.
  • Fig. 17 is a schematic diagram illustrating yet another embodiment of transforming an abstract syntax tree according to the disclosed scheme
  • FIG. 18 is a schematic diagram illustrating yet another embodiment of converting an abstract syntax tree according to the disclosed scheme.
  • FIG. 19 is a schematic diagram illustrating yet another embodiment of converting an abstract syntax tree according to the disclosed scheme.
  • FIG. 20 is a schematic diagram illustrating yet another embodiment of transforming an abstract syntax tree according to the disclosed scheme.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be construed, depending on the context, to mean “once determined” or “in response to the determination” or “once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
  • a processor 1-100 includes a plurality of processing units 1-101 and a storage unit 1-102.
  • the plurality of processing units 1-101 are used to execute instruction sequences, and the storage unit 1-102 is used to store data.
  • RAM random access memory
  • register file Including random access memory (RAM, Random Access Memory) and register file.
  • the multiple processing units 1-101 in the processor 1-100 can not only share part of the storage space, for example share part of the RAM storage space and the register file, but also have their own storage spaces at the same time.
  • a compiler for executing the compilation and optimization method may run on the processor, and the compiler may be a computer program running on the processor.
  • the compiler is used to convert the received source program into an object code that the target processor can run, for example, an instruction that the target processor can directly run.
  • the compiler of the present disclosure may be a general-purpose compiler, or a neural network compiler capable of compiling neural network calculation graphs, such as TVM, Mindspore, and the like.
  • the target processor may be the above-mentioned general processor, or an artificial intelligence processor (IPU) for performing artificial intelligence calculations.
  • artificial intelligence computing may include machine learning computing, brain-inspired computing, and the like.
  • machine learning operations include neural network operations, k-means operations, support vector machine operations, and the like.
  • the artificial intelligence processor may, for example, include GPU (Graphics Processing Unit, graphics processing unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array One or a combination of (Field-Programmable Gate Array, FPGA) chips.
  • GPU Graphics Processing Unit, graphics processing unit
  • NPU Neuro-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing unit
  • field programmable gate array One or a combination of (Field-Programmable Gate Array, FPGA) chips.
  • the present disclosure is not limited to a specific type of processor.
  • the target processor mentioned in this disclosure may include a plurality of processing units, and each processing unit may independently run various assigned tasks, such as convolution operation tasks and pooling tasks. Or fully connected tasks, etc.
  • the disclosure does not limit the processing unit and the tasks executed by the processing unit.
  • the target processor is a processor capable of parallel computing, and the target processor can process tensor data of granularity such as vector or matrix.
  • the target processor can usually use vectors, matrices, or even larger-grained tensor data for operations.
  • the shape of the three-dimensional feature map can be expressed as HWC, where H represents the height of the tensor data, W represents the width of the tensor data, and C represents the number of channels of the tensor data.
  • the H dimension can be regarded as the highest dimension
  • the C dimension can be regarded as the lowest dimension
  • the W dimension is an intermediate dimension between the highest dimension and the lowest dimension.
  • the tensor data can also be four-dimensional tensor data, which can be expressed as NHWC, where N represents the number of batches (Batch size), and N can be considered as the highest dimension.
  • multi-dimensional loops are usually used.
  • H ⁇ W ⁇ C cycles can be performed, and a vector in one dimension direction is processed each time (that is, multiple data in a certain dimension direction are read).
  • the loop statement will introduce more control flow at compile time, resulting in the need for multiple data accesses during the running phase, resulting in multiple read and write overheads, resulting in a decrease in the processing efficiency of the target processor for tensor data.
  • the disclosure provides a compiling optimization method, which can determine the dimension folding information of the tensor (including the dimension folding direction and foldable dimensions, etc.) according to the dimension information of the tensor data, and according to the tensor Dimension folding information performs dimension folding on tensor data to generate operation instructions based on dimensionally folded tensor data (wherein, the operation instructions include but not limited to memory access instructions and operation instructions), so that the reading of tensor data can be optimized.
  • Write access which in turn can optimize the data processing process of the target processor on the tensor data during the running phase.
  • the compilation optimization method includes:
  • the tensor data to be processed may be two-dimensional or more than two-dimensional tensor data, and the arrangement of the dimensions may have a certain order.
  • the disclosed The tensor data to be processed can be three-dimensional or four-dimensional tensor data.
  • Each dimensional direction may have a dimensional value, indicating the amount of data in the dimensional direction (including but not limited to the number of data, the bit width of data, etc.).
  • the three-dimensional tensor data can be expressed as HWC, where the data volume in the H-dimension direction can be represented by m, the data volume in the W-dimension direction can be represented by k, and the data volume in the C-dimension direction It can be represented by n.
  • the dimension information of tensor data to be processed may be determined by the user through the definition of tensor data during the programming process.
  • the user can directly assign a certain dimension value of the tensor data to a constant (such as an immediate value) through the definition method of the tensor data, and the user can also determine the dimension value of the tensor data through the way of variable declaration.
  • the compiler may replace the shape of the tensor data with a constant representation.
  • the compiler can express the shape of the 3D tensor data as 4 ⁇ 4 ⁇ 4.
  • the compiler in the embodiment of the present disclosure may determine whether adjacent dimensions of the tensor data to be processed are continuous according to the dimension information of the tensor data to be processed, so as to determine dimension folding information of the tensor data to be processed.
  • the dimension folding information of the tensor data to be processed includes, but is not limited to, the direction of dimension folding (such as collapsible dimensions) and the number of collapsible dimensions.
  • the compiler can determine whether the adjacent dimensions of the tensor data are continuous, and if the adjacent dimensions of the tensor data are continuous, it can be determined that the adjacent dimensions can be folded. Since the multi-dimensional tensor data is stored in a one-dimensional manner in the order of dimensions in the memory, for example, the multi-dimensional tensor data is stored sequentially from high-dimensional to low-dimensional, therefore, the difference in the direction of the same dimension of the multi-dimensional tensor data starts There is a certain stride/span (stride) between the initial elements, and the stride/span can be equal to a dimension value in a certain dimension direction.
  • stride/span stride/span
  • the step size between the data in the first dimension direction in the multidimensional tensor data is equal to the product of the step size between the data in the second dimension direction and the size of the data in the second dimension direction, then it can be determined that adjacent The dimensions of are continuous, where the first dimension is higher than the second.
  • W stride represents the step size of the W dimension
  • C stride and C size represent the step size and size of the C dimension respectively, wherein the W dimension is higher than the C dimension.
  • the present disclosure may first determine whether the second dimension and the third dimension are continuous, and then determine whether the first dimension and the second dimension are continuous in the above manner in the case that the second dimension and the third dimension are continuous.
  • whether adjacent dimensions are continuous is judged sequentially from low dimension to high dimension. In other examples, whether adjacent dimensions are continuous may be judged sequentially from high dimension to low dimension.
  • the present disclosure is described by way of example only and not specifically limited.
  • the compiler can perform dimension folding processing on the tensor data to be processed according to the dimension folding information.
  • the embodiments of the present disclosure may perform dimensionality reduction on multi-dimensional tensors to obtain tensor data with lower dimensions but higher dimensions of each dimension.
  • the embodiment of the present disclosure can reduce the dimensionality of the above three-dimensional feature map into a two-dimensional tensor, for example, the tensor data after dimensionality reduction processing can be expressed as (m ⁇ k) ⁇ n, where the dimensionality reduction The two-dimensional tensor data of can have one dimension value (m ⁇ k) and the other dimension value n.
  • the tensor data after dimension reduction when processing the tensor data after dimension reduction, only two layers of nested loops are needed instead of three layers of nested loops, thereby reducing the number of loops, reducing control flow, and improving processing efficiency.
  • it when processing the tensor data after dimension reduction, it can be looped n times, and the data volume of m ⁇ k size can be processed each time, so that the memory access cost and operation cost of the tensor data can be reduced by the target processor.
  • the tensor data after dimensionality reduction may be one-dimensional data, and its dimension value may be (m ⁇ k ⁇ n).
  • N and H are adjacent dimensions, where N is the first dimension and H is the second dimension, where the first One dimension is higher than the second dimension.
  • H dimension and the W dimension are adjacent dimensions, wherein the H dimension is the first dimension and the W dimension is the second dimension.
  • W dimension and the C dimension are adjacent dimensions, wherein the W dimension is the first dimension and the C dimension is the second dimension.
  • the tensor data of C 0 ⁇ C 4 , W 0 ⁇ W 4 , and H 0 -H 4 are further labeled (shown as a large cube, each small cube contained therein represents a data block with the smallest granularity)
  • the tensor data can have the data shape shown in the right part 301 of Figure 3 (that is, the shape after "dimension reduction"), thereby facilitating subsequent data Read, significantly reduce the nested loop operations in the code implementation process (detailed below).
  • the first data block may correspond to the coordinate H 0 W 0 C 0
  • the second data block may correspond to the coordinate H 0 W 0 C 1
  • the fourth data block may correspond to the coordinate H 0 W 0 C 4 , that is, the first to fourth data blocks correspond to the four data blocks arranged along the C dimension in the upper left corner of the tensor data.
  • the 5th data block corresponds to the coordinate H 0 W 1 C 0
  • the 6th data block corresponds to the coordinate H 0 W 1 C 1
  • the 7th data block corresponds to the coordinate H 0 W 1 C 2
  • the 8th data block corresponds to Coordinates H 0 W 1 C 3 .
  • the folded “WC” of the data of the H dimension is It is continuous in dimension.
  • the two-dimensional matrix obtained above is then dimensionally folded, so that a set of tensor data in the form of an array (or shape) as shown at 302 can be obtained, that is, data 1 to data 64 arranged along the H dimension array of .
  • the data cube is presented in a one-dimensional form by dimension collapsing of the H dimension and the "WC" dimension,
  • the above tensor data to be processed may be constant data or variable data.
  • the tensor data to be processed can be considered as constant data; when at least one dimension value of the tensor data to be processed is variable, the tensor data to be processed can be considered as Tensor data is variable data.
  • each dimension value of the tensor data to be processed may be determined by the user through assignment or the like.
  • the corresponding dimension value of the tensor data to be processed may be replaced with a constant.
  • the user can specify the dimension value of the tensor data to be processed as a constant by assignment:
  • %c64 and %c128 represent the dimension values of tensor data to be processed respectively.
  • the representation of the tensor data to be processed is:
  • %c std.view %1[0,0][%c64,%c128]: memref ⁇ 32768xi8,101>to memref ⁇ ? x? xi32,101>, where "?" indicates a variable.
  • the representation of the tensor data to be processed can be:
  • the tensor data to be processed when the tensor data to be processed is constant data, it may be determined according to each dimension value of the tensor data to be processed whether its adjacent dimensions are continuous. If adjacent dimensions are continuous, it is determined that the adjacent dimensions can be folded into one dimension. If adjacent dimensions are discontinuous, it is determined that adjacent dimensions cannot be collapsed. Further, when the adjacent dimensions of the tensor data to be processed are continuous, and the dimension values of the adjacent dimensions are constant, the adjacent dimensions can be folded, and the tensor data to be processed can be converted into folded tensor data. Quantitative dimensions are represented. Wherein, the dimension value of the folded tensor is the product of the dimension values of consecutive adjacent dimensions.
  • the conversion of the representation form of the tensor data to be processed can be realized by the following folding statement:
  • variable data since at least one dimension value of the tensor data to be processed is a variable (the variable cannot be determined at compile time, the variable can be determined by an expression during the compile process Indicates that its specific value needs to be determined at runtime), at this time, it is impossible to directly determine whether the adjacent dimension is continuous through its dimension value, at this time, the compiler can use the expression of the parameters related to the adjacent dimension of the tensor data to be processed. Determines whether its adjacent dimensions are contiguous.
  • the expressions of the above-mentioned parameters related to adjacent dimensions can be converted into an abstract syntax tree, and the compiler can respectively determine the high-dimensional step size W stride and the low-dimensional C stride ⁇ C size according to the static single assignment (SSA) Abstract the syntax tree, and simplify and transform the two syntax trees so that the two syntax trees have the same tree structure (for example, the root nodes of the two syntax trees are also addition operations, and the leaf nodes can be multiplication operations) .
  • SSA static single assignment
  • the embodiment of the present disclosure can move the root node or child node corresponding to the multiplication operation to the lower layer of the abstract syntax tree; and move the child node and/or leaf node corresponding to the addition operation to the abstract syntax tree
  • the upper direction of the syntax tree is moved to obtain a syntax tree with the same tree structure.
  • the compiler can determine whether the adjacent dimensions are continuous by judging whether the two syntax trees are the same, so as to determine whether the adjacent dimensions can be folded.
  • FIG. 4 shows two syntax trees determined according to expressions, and two syntax trees are obtained after transformation.
  • the upper part of FIG. 4 is the first abstract syntax tree of the present disclosure, which may be constructed during the calculation process of the intermediate expression of the operation to trace forward to obtain the step size of the first dimension (such as the aforementioned W dimension).
  • the lower part of FIG. 4 is the second abstract syntax tree of the present disclosure, which can be constructed during the calculation process of obtaining the step size and size of the second dimension (such as the aforementioned C dimension) retroactively along the intermediate expression of the operation. of.
  • the first abstract syntax tree can represent an expression of "(a ⁇ b)+(c ⁇ d)+(e ⁇ f ⁇ g)+(-h)” (which, for example, etc. valence to W stride ), while the second abstract syntax tree may represent "(-h)(b ⁇ a)+(c ⁇ d)+(e ⁇ f ⁇ g)" (which is equivalent to "C stride ⁇ C size ") expression. Since there are still differences between the first abstract syntax tree and the second abstract syntax tree on the second layer and the third layer, in an implementation scenario, the embodiment of the present disclosure may also use a fingerprint-based hash operation to determine the first root node Are all child nodes under and all child nodes under the second root node equivalent to each other.
  • the compiler can also obtain the folded dimension index of the tensor data to be processed according to the foldable dimension, and the folded dimension index is used to indicate the The index of the largest contiguous dimension of the tensor data. Therefore, the compiler can obtain dimension folding information such as which dimensions can be folded, and the number of dimensions that can be folded (dimension folding index), so as to perform dimension folding on tensor data to be processed according to the dimension folding information.
  • the tensor data to be processed is HWC, and its total dimensionality is 3. If it is determined that the W dimension and the C dimension are continuous, it can be determined that the W dimension and the C dimension can be folded, and it can be determined that the dimension folding index value of the tensor data to be processed is 2. If the H dimension, W dimension, and C dimension are all continuous, it can be determined that the H dimension, W dimension, and C dimension can all be folded. At this time, the dimension folding index value of the tensor data to be processed is 3, and the dimension folding index value is equal to its total number of dimensions.
  • the tensor data to be processed in the present disclosure may be a data block participating in operation and/or memory access operation.
  • the processor when it executes a task, it may process multiple types of information, for example, tensor data involved in the task and at least one operation required to be performed by the task.
  • the task may include any one of an image processing task, a video processing task, a text processing task, and a voice processing task, and the present disclosure does not limit the types of preset tasks.
  • each operand of the operation operation may be a tensor data to be processed.
  • the compilation optimization method may include:
  • step S510 dimension information of at least one tensor data to be processed in the operation is acquired, wherein the dimension information includes a dimension value of at least one dimension of the tensor data.
  • the above operations include arithmetic operations and memory access operations, etc.
  • the arithmetic operations include but not limited to multiplication operations, addition operations, convolution operations, activation operations and other operations.
  • Each operation can operate on specific tensor data, and each operation requires at least one tensor data to be processed.
  • the operation can be compiled into a corresponding operation instruction, and the target processor can execute the operation instruction to realize the corresponding operation.
  • each data participating in the operation can be tensor data.
  • the tensor data to be processed may be one-dimensional tensor, that is, vector data; the tensor data may also be two-dimensional tensor, that is, matrix data; and the tensor data may also be multi-dimensional tensor.
  • the tensor data may include dimension values of at least one dimension, at least one dimension value is used to describe the shape and size of the tensor data, and the product of at least one dimension value is the size of the tensor data.
  • the dimension values of the tensor data may include row dimension values and column dimension values, and the tensor data may be expressed as row dimension values ⁇ column dimension values, for example, 64 ⁇ 128.
  • the dimension value of the tensor data can include length value, width value and height value, and the tensor data can be expressed as length value ⁇ width value ⁇ height value, for example, 64 ⁇ 128 ⁇ 32 .
  • step S520 the dimension folding information of each tensor data to be processed is determined according to the dimension information of each tensor data to be processed.
  • each tensor data to be processed in the operation it can be determined whether the adjacent dimensions of each tensor data to be processed are continuous according to the method of the above-mentioned embodiment, so as to obtain the dimension of each tensor data to be processed Collapse information.
  • the specific method for determining the dimension folding information of each tensor data to be processed can be referred to above, and will not be repeated here.
  • the target dimension folding information is determined according to the dimension folding information of all tensor data to be processed in the operation.
  • the target folding information may include the target collapsible dimensions of each tensor data to be processed in the operation and the number of target collapsible dimensions (that is, target dimension collapsible index).
  • the folding methods of all the tensor data to be processed involved in the operation should be the same, that is, the dimension folding direction required by each tensor data to be processed is consistent and each tensor data to be processed The dimensions of the data have the same collapsing index value.
  • the method for determining the collapsible dimension information of the above target may include:
  • the target collapsible dimension may be an intersection of the collapsible dimensions of the at least one operand
  • the target collapsible dimension index Based on the target collapsible dimension and the dimension collapsing guideline of the at least one operand, determine the target collapsible dimension index, wherein the target dimension collapsible index may be all tensor data to be processed in the operation in the target collapsible dimension direction
  • the above-mentioned operation is an addition operation, and the addition operation involves two operands (namely, an addend and an augend), both of which are three-dimensional tensor data.
  • the addend can be expressed as H1W1C1
  • the dimension folding information of the addend is: it can be folded in the H dimension, W dimension, and C dimension, and the dimension folding index value is 3.
  • the summand can be expressed as H2W2C2.
  • the dimension folding information of the addend is: W dimension and C dimension can be folded, and the dimension folding index value is 2.
  • the compiler can determine that the target folding information is: the W dimension and the C dimension can be folded, and the dimension folding index value is 2.
  • the dimension folding information of the addend is: it can be folded in the H dimension and the W dimension, and the dimension folding index value is 2.
  • the dimension folding information of the addend is: the W dimension and the C dimension can be folded, and the dimension folding index value is 2. Then the compiler can determine that the target folding information is: none of the dimensions can be folded, and the dimension folding index value is 0.
  • step S540 according to the target dimension folding information, perform dimension folding on all tensor data to be processed in the operation.
  • the compiler can perform consistent dimension folding operations on all tensor data to be processed in the operation according to the collapsible dimensions indicated in the target dimension folding information and the target dimension folding guidelines, so as to ensure correct execution of the operations.
  • the compilation optimization method of the embodiments of the present disclosure can perform dimensionality reduction processing on the dimension information of the tensor data through the compiler during compilation, and can obtain tensor data with lower dimensions and higher dimension values, reducing the need for tensor data.
  • the number of data loop processing reduces the control flow, so the bandwidth of the processor and memory can be fully utilized at runtime, and more data can be read at one time, which improves the processing efficiency of tensor data.
  • the above method can also perform corresponding operations according to the tensor data after dimensionality reduction processing.
  • the above method may further include: in step S550, after dimensionally folding the operands in the operation, compiling and generating the operation to obtain an instruction corresponding to the operation.
  • the instruction may be a hardware instruction executable by the target processor.
  • the functions of the above operations can be realized, including but not limited to computing functions and memory access functions (such as data copy functions, etc.).
  • this operation is an addition operation, which may include three operands %a, %b and %c, the three operands are all tensor data, and the three operands are respectively %a' after dimension folding, %b' and %c'.
  • %a' mlu.memrefcast%a memref ⁇ 64x128xi32,101>to memref ⁇ 8192xi32,101>;
  • %b' mlu.memrefcast%b memref ⁇ 64x128xi32,101>to memref ⁇ 8192xi32,101>;
  • %c' mlu.memrefcast%c memref ⁇ 64x128xi32,101>to memref ⁇ 8192xi32,101>;
  • each tensor data is 8192, and each tensor data is calculated and accessed according to the granularity of 8192, so that the target processor executes the hardware generated based on this Instructions can improve data memory access efficiency and operation efficiency.
  • the expression form of the result data after the above operation may be consistent with the expression form of the folded tensor data.
  • the dimension of the resulting data may be different from the dimension of the initial tensor data to be processed.
  • the representation of the folded tensor data can be expanded into the initial representation. .
  • the representation of each operand in the addition operation is 8192
  • the representation of the result data obtained by the addition operation is also 8192.
  • the addition operation can be The resulting data is expanded to 64 ⁇ 128.
  • the present disclosure also provides a computer device.
  • the computer device may include a processor and a memory, and a computer executable program is stored on the memory.
  • a processor executes the above computer executable program, the method of any of the above embodiments is implemented.
  • a compiler may run on the processor, and the compiler may implement the method in any of the foregoing embodiments.
  • the compiler is used to: obtain dimension information of the tensor data to be processed, wherein the dimension information includes a dimension value of at least one dimension of the tensor data to be processed; according to the dimension information of the tensor data to be processed , determining dimension folding information of the tensor data to be processed; performing dimension folding on the tensor data to be processed according to the dimension folding information of the tensor data to be processed.
  • the dimension folding information includes collapsible dimensions; the compiler is specifically configured to: determine whether adjacent dimensions of the tensor data to be processed are continuous according to the dimension information of the tensor data to be processed; when the When the adjacent dimensions of the tensor data to be processed are continuous, it is determined that the adjacent dimensions are collapsible dimensions.
  • the adjacent dimensions of the tensor data to be processed include the first dimension and the second dimension
  • the compiler is further configured to: when the step size between the data in the direction of the first dimension is equal to the second When the product of the step size between the data in the dimension direction and the size of the data in the second dimension direction is determined, it is determined that the adjacent dimensions of the tensor data to be processed are continuous.
  • the dimension folding information further includes a dimension folding index; the compiler is further configured to: determine the collapsible dimension index according to the number of collapsible dimensions in the tensor data to be processed.
  • the compiler is further configured to: when the dimension value of the tensor data to be processed is constant, then determine the folded tensor dimension according to the dimension folding information of the tensor data to be processed, so as to implement Dimension reduction processing of the tensor data to be processed; wherein, the folded tensor dimension is equal to the product of the dimension values of the foldable dimensions of the tensor data to be processed.
  • the compiler is further configured to: if the dimension value of the tensor data to be processed is a constant, then replace the corresponding dimension value with the constant.
  • the tensor data to be processed points to the operands in the operation; the compiler is further configured to: determine the target dimension folding information according to the dimension folding information corresponding to at least one operand in the operation; the target The dimension folding information is used to indicate a dimension folding manner of the at least one operand in the operation.
  • the target dimension folding information includes a target collapsible dimension and a target dimension collapsible index; the compiler is further configured to: determine the target collapsible dimension according to the collapsible dimension of the at least one operand, wherein the A target collapsible dimension is an intersection of collapsible dimensions of the at least one operand; and the target dimension collapsible index is determined based on the target collapsible dimension and a dimension collapsing guideline of the at least one operand.
  • the compiler is further configured to: perform dimension folding on all operands in the operation according to the target dimension folding information.
  • the compiler is further configured to: after dimensionally folding the operands in the operation, compile and generate the operation to obtain the instruction corresponding to the operation, so that the target processor can implement the instruction according to the instruction.
  • the function of the operation that obtains the resulting data.
  • the compiler is further configured to: perform dimension expansion on the result data according to the target folding information, so that the expression form of the result data is consistent with the initial expression form of the operand.
  • the present disclosure also provides a method for processing the tensor data to be processed (ie, multi-dimensional tensor data) and related products.
  • a method for processing the tensor data to be processed ie, multi-dimensional tensor data
  • related products ie, multi-dimensional tensor data
  • FIG. 12 is a flowchart illustrating a method 700 for processing multi-dimensional tensor data according to an embodiment of the present disclosure.
  • the method 700 of the present disclosure can be applied to a scenario where operations are performed using multi-dimensional tensor data.
  • the method 700 can be applied to the processing of multi-dimensional tensor data in the field of artificial intelligence (such as deep machine learning).
  • the method 700 here can be executed by a processor, for example, at the stage of compiling the program code for the neural network model, that is, by using the compiler 603 in FIG. 12 .
  • the determination of whether the adjacent dimensions of the tensor data to be processed are continuous according to the syntax tree of the parameters related to the adjacent dimensions of the tensor data to be processed described above may include as shown in Figure 12 method shown.
  • a first expression associated with a first dimension in the multidimensional tensor data is obtained.
  • the multi-dimensional tensor data here can be two-dimensional or more than two-dimensional tensor data, and the arrangement of dimensions can have a certain order.
  • the multi-dimensional tensor data herein may be three-dimensional or four-dimensional tensor data.
  • the tensor data may have a data format of NHWC, where N represents the number of batches (Batch size), H represents the height of the tensor data, W represents the width of the tensor data, and C represents the number of channels of tensor data, such as the data cube shown in Figure 3.
  • N dimension can be considered as the highest dimension
  • C dimension can be considered as the lowest dimension
  • H and W dimensions are intermediate dimensions between the highest and lowest dimensions
  • the H dimension is higher than the W dimension.
  • a second expression associated with a second dimension in the multidimensional tensor data is obtained.
  • the second dimension here and the above-mentioned first dimension are adjacent dimensions.
  • N and H are adjacent dimensions, where N is the first dimension and H is the second dimension.
  • the H dimension and the W dimension are adjacent dimensions, wherein the H dimension is the first dimension and the W dimension is the second dimension.
  • the W dimension and the C dimension are adjacent dimensions, wherein the W dimension is the first dimension and the C dimension is the second dimension.
  • the above-mentioned first expression may be an expression constructed based on the stride ("stride") of the first dimension of the multidimensional tensor data.
  • the above second expression may be an expression constructed based on the stride ("stride") and size ("size") of the second dimension of the multidimensional tensor data.
  • step S706 it is judged whether the first expression and the second expression are equal, so as to determine whether the first dimension and the second dimension can perform dimension folding.
  • the first dimension is the W dimension and the second dimension is the C dimension
  • the first expression is based on the span of the W dimension
  • the second expression is based on the span and size of the C dimension
  • Equation: W stride C stride ⁇ C size , wherein W stride represents the span of W dimension and C stride and C size represent the span and size of C dimension, respectively.
  • the tensor data can have the data shape shown in the right part 301 of Figure 3 (that is, the shape after "dimension reduction"), thereby facilitating subsequent data Read, significantly reduce the nested loop operations in the code implementation process (detailed below).
  • the first data block may correspond to the coordinate H 0 W 0 C 0
  • the second data block may correspond to the coordinate H 0 W 0 C 1
  • the fourth data block may correspond to the coordinate H 0 W 0 C 4 , that is, the first to fourth data blocks correspond to the four data blocks arranged along the C dimension in the upper left corner of the tensor data.
  • the 5th data block corresponds to the coordinate H 0 W 1 C 0
  • the 6th data block corresponds to the coordinate H 0 W 1 C 1
  • the 7th data block corresponds to the coordinate H 0 W 1 C 2
  • the 8th data block corresponds to Coordinates H 0 W 1 C 3 .
  • HWC three-dimensional data
  • FIG. 3 When performing data copying, from the perspective of code implementation, it is common practice to execute multiple nested loops to read all the data, where the size of the H dimension determines the number of loops in the outer layer, and the size of the W dimension determines the inner layer The number of loops constitutes two layers of nested loops, and the size of the C dimension determines the number of data blocks read in each loop.
  • the solution of the present disclosure can generate operation instructions for the tensor data according to the dimensionally folded tensor data
  • the operation instructions include but not limited to memory access instructions and operation instructions
  • the operation instructions can be the above-mentioned chip or board Instructions that the card is able to execute.
  • the granularity of data reading and calculation after dimension folding in the present disclosure is increased (from 4 to 16), and the chip or board executes the input/output generated by the optimized code based on the dimension folding (" I/O") memory access instructions can significantly reduce the number of I/O memory accesses and thus greatly reduce the overhead on I/O.
  • the chip or board executes the operation instructions generated based on the dimensionally folded and optimized code, which can significantly improve the operation efficiency of the hardware.
  • the tensor cube in the figure can be "dimension-reduced" into a two-dimensional matrix as shown in 301, so as to speed up data reading and reduce Small nested loop operations as described above, wherein the height of the matrix is still formed by the H dimension, and the width is formed by the folded new dimensions of the C and W dimensions. According to the folding scheme of the present disclosure, it is still possible to determine whether to perform the folding operation on the H dimension and the new dimension.
  • the span of the H dimension is 16, and the span of the new dimension is 1 and the size is 16, it still satisfies the equation requirement set by the disclosure, that is, the folded "WC" of the data of the H dimension is It is continuous in dimension.
  • the above-mentioned obtained two-dimensional matrix is then dimensionally folded, so that a set of tensor data in the form of an array (or shape) as shown at 302 can be obtained, that is, data 1 to data 64 arranged along the H dimension array of .
  • the data cube is presented in a one-dimensional form through the dimension folding of the H dimension and the "WC" dimension, and a data cube in Figure 3 can be read as a whole when copying data, that is, no nested loop operation is required.
  • the code can be further optimized to reduce the overhead of control flow.
  • multi-dimensional tensor data is folded to reduce dimensions, it is also convenient for various operations between tensor data. For example, for the addition operation between two tensor data, the addition operation after dimensionality reduction will be more convenient and fast.
  • the tensor dimension folding in the disclosed scheme is not limited to the two-dimensional folding scheme.
  • the folding operation can be performed in a higher dimension. For example, after determining that the W dimension and the C dimension in Fig.
  • the method for judging whether the folded (WC) dimension and the H dimension are continuous is the same as the method for judging the above-mentioned W dimension and the C dimension, and will not be repeated here.
  • the method shown in Fig. 12 can be applied to the intermediate representation generated by compiling the neural network model, where the compiling operation can be executed by a compiler.
  • a compiler can generally be divided into a front end and a back end. During the compilation process, the front end will perform various types of analysis on the input program code, such as including lexical analysis, syntax analysis or semantic analysis, and then generate an intermediate expression IR (Intermediate Representation) similar to the data structure.
  • IR Intermediate Representation
  • the compiler here can be a neural network compiler, such as "TVM”.
  • the neural network compiler can receive a neural network model from a neural network programming framework (such as Tensorflow, pytorch, or caffe, etc.). Then, the neural network is parsed and reconstructed by the compilation front-end to obtain the calculation graph. Then, the optimizer in the compiler can be used to perform optimization operations such as fusion (such as the fusion of multiple operators) or pruning (such as removing edges and operators that are not related to the final output node in the calculation graph) on the generated calculation graph.
  • fusion such as the fusion of multiple operators
  • pruning such as removing edges and operators that are not related to the final output node in the calculation graph
  • an intermediate representation in the form of a graph is obtained, that is, the aforementioned "IR”.
  • the graphical intermediate representation can be converted into an intermediate representation compatible with the hardware.
  • the code executable on the hardware can be generated according to the intermediate expression adapted to the hardware, that is, the binary instruction sequence that can be executed by, for example, the AI processor described in conjunction with FIG. 11 . A detailed description.
  • the method 700 can also obtain the first abstract syntax tree constructed according to the first expression and the second abstract syntax tree constructed according to the second expression.
  • an abstract syntax tree includes a plurality of nodes, and can be roughly divided into root nodes, which are located at the uppermost level of the abstract syntax tree; leaf nodes, which are located at the lowest level of the abstract syntax tree; and child nodes, which An intermediate layer located between the root node and the leaf nodes. Based on this, the process of constructing the abstract syntax tree is to construct the root node of the abstract syntax tree first, and then construct each child node (if necessary) down to the leaf node.
  • the scheme of the present disclosure can convert judging whether the first expression and the second expression are equal to judging whether the first abstract syntax tree and the second abstract syntax tree are equal (or equivalent) .
  • the first abstract syntax tree and the second abstract syntax tree are equal, it can be determined that the first expression is equal to the second expression, and thus it can be determined that the aforementioned first dimension and second dimension can perform dimension folding.
  • the first abstract syntax tree and the second abstract syntax tree are not equal, it can be determined that the first expression is not equal to the second expression, thus it can be determined that the first dimension and the second dimension cannot be dimensioned fold.
  • the dimension folding operation of the present application when the dimension folding operation of the present application is performed at the compilation stage, multiple nested loop operations can be reduced, thereby reducing the overhead of control flow. Furthermore, due to the dimension reduction operation of dimension folding, the I/O instructions generated for the reduced dimension tensor data will significantly improve the memory access efficiency when performing data access, thereby reducing the I/O overhead .
  • the present disclosure proposes to retroactively obtain the span of the first dimension along the intermediate expression
  • the calculation process of (which is also an expression).
  • the root node of the first abstract syntax tree and the leaf nodes under the aforementioned root node or the child nodes and leaf nodes under the root node may be constructed based on the calculation process.
  • obtaining the second abstract syntax tree may include tracing forward the calculation process of obtaining the span and size of the second dimension along the intermediate expression of the neural network model.
  • the root node and the leaf nodes under the root node or the child nodes and leaf nodes under the root node of the second abstract syntax tree may be constructed based on the calculation process.
  • each node in the abstract syntax tree can be used to represent a constant or a variable in the program code.
  • the aforementioned variables may be of "Static Single Assignment", SSA.
  • the root node and child nodes may represent their respective associated SSA values and operations to be performed to obtain the SSA values, and the leaf nodes represent the SSA values traced back to the stop.
  • the leaf nodes represent the SSA values traced back to the stop.
  • the child nodes may further represent operands required to perform operations.
  • the constant or variable in the syntax tree may have a preset coefficient (“coefficient”), which is used to represent the coefficient that the current node needs to multiply when it participates in the calculation of its parent node.
  • the present disclosure proposes to stop the tracing process when some situations occur. For example, when tracing along an intermediate expression to an entry parameter of a kernel function ("kernel") or when tracing an intermediate expression to a specific operation.
  • kernel function can represent the calculation process of the neural network model or the calculation process of the operator (such as the convolution operator), and the entry parameter can be the input parameter required by the calculation process.
  • the construction of the abstract syntax tree is completed. Thereafter, the obtained first abstract syntax tree and the second abstract syntax tree may be compared to determine whether they are the same.
  • the first abstract syntax tree or the second abstract syntax tree can be a hierarchical structure composed of nodes, so judging whether the first abstract syntax tree and the second abstract syntax tree are the same can be to judge the first abstract syntax layer by layer Whether the syntax tree and the second abstract syntax tree are the same. Additionally or alternatively, it may also be determined whether the first abstract syntax tree and the second abstract syntax tree are the same by judging whether the first predetermined number of layers are the same.
  • the present disclosure proposes only the first several layers (such as the first two layers or the first three layers) of the two abstract syntax trees layer) for comparison and judgment, thereby simplifying the comparison operation of two abstract syntax trees.
  • judging whether the first abstract syntax tree and the second abstract syntax tree are the same may include judging the first root node of the first abstract syntax tree and the root node of the second abstract syntax tree Whether the second root node has the same operation type. When both have the same operation type, it is also necessary to determine whether all child nodes and leaf nodes under the first root node and all child nodes and leaf nodes under the second root node are the same or equivalent to each other.
  • the abstract syntax tree can have a hierarchical structure, so when comparing the first abstract syntax tree and the second abstract syntax tree, it can be judged layer by layer whether each child node and the leaf node at the lowest level are the same or equal price.
  • the first abstract syntax tree and the second abstract syntax tree are the same abstract syntax tree, and thus determine the first dimension Dimension folding can be performed with the second dimension, that is, the data on the first dimension and the second dimension are placed on the same dimension.
  • the first expression and the second expression are different, so that Determining the first and second dimensions does not allow for dimension collapsing.
  • tensor data will also not be able to be expressed, read, written, and operated in a "dimension-reduced" form.
  • an abstract syntax tree in its simplest form usually has at most three levels arranged from top to bottom, namely the root level-the middle level (including child nodes representing addition or multiplication operations)-leaf node level.
  • the converted abstract syntax tree has only two levels, ie, there is no child node in the middle level. It can be understood that the following conversion operations in the present disclosure are only exemplary and optional, and the folding operations in the present disclosure are not limited thereto. For example, for the constructed abstract syntax tree that initially has the aforementioned simplest form, the following conversion operation may not be performed. For another example, for two abstract syntax trees that cannot be compared after being constructed (such as those with different layers and cannot be simplified by the following conversion operations), the disclosure can directly determine that the first expression and the second expression are not the same . Therefore, it can be directly judged that the first dimension and the second dimension are not foldable, that is, they cannot be folded and read continuously as mentioned above, that is, loop nested reading is still required. The following will describe different embodiments of converting the obtained abstract syntax tree in this disclosure with reference to FIGS. 13-20 .
  • Fig. 13 is a scene showing that in the process of converting the abstract syntax tree, the subtraction operation is converted into an addition operation, and the multiplication operation is moved down and the addition operation is moved up.
  • the coefficient of the node can be set to -1 (the default is 1), where the coefficient can represent the value to be multiplied when the current node participates in the calculation of the parent node. In this case, there will only be addition and multiplication operations in the abstract syntax tree, which is beneficial to subsequent judgment operations.
  • the 13 is to move the root node or child node corresponding to the multiplication operation to the lower layer of the abstract syntax tree (such as the first abstract syntax tree and/or the second abstract syntax tree of the present disclosure), And move the child nodes and/or leaf nodes corresponding to the addition operation to the upper layer of the abstract syntax tree.
  • the lower layer of the abstract syntax tree such as the first abstract syntax tree and/or the second abstract syntax tree of the present disclosure
  • FIG. 14 and FIG. 15 show the scene of merging parent nodes and child nodes with the same operation type during the process of converting the abstract syntax tree.
  • the abstract syntax tree has a root node representing the addition operation and two child nodes performing the addition operation.
  • the root node and the two child nodes can be merged to obtain the abstract syntax tree shown in the right part of Fig. 14 .
  • the conversion operation of the example shown in the figure is to convert the expression "(a+b)+(c+d)" into the expression "a+b+c+ d".
  • the abstract syntax tree has a root node representing an addition operation and two child nodes representing an addition operation and a multiplication operation, respectively. Similar to Figure 14, since the root node and the left child node represent the same addition operation, the left child node and the root node can be merged to move the variables a and b represented by the leaf nodes to the upper layer; thus, Abstract syntax tree as shown in the right part of FIG. 15 . From the expression represented by the abstract syntax tree, the conversion operation of the example shown in the figure is to convert the expression "(a+b)+(c ⁇ d)" into the expression "a+b+(c ⁇ d) d)".
  • the converted abstract syntax tree usually has at most three layers arranged from top to bottom, that is, tree root layer-middle layer (including child nodes representing addition or multiplication operations)-leaf node layer.
  • the converted abstract syntax tree has only two levels, ie, there is no child node in the middle level.
  • the root node in the abstract syntax tree is directly connected to the leaf nodes.
  • the conversion operation shown in Figure 14 and Figure 15 is to determine whether there are root nodes and child nodes with the same operation type in the abstract syntax tree, and in response to the existence of the root node and child nodes with the same operation type, perform the root node and child nodes merge.
  • the present disclosure also proposes to perform a reduction operation on the abstract syntax tree, specifically a reduction operation on similar items.
  • the same item may be a node representing an immediate value, an immediate value multiplied by a variable, and/or a variable form in the abstract syntax tree.
  • Figure 16 shows the simplification operation on the nodes representing immediate numbers, that is, the simplification of the leaf nodes representing 7, 4 and 2 in the abstract syntax tree on the left in Figure 16 becomes the representation in the abstract syntax tree on the right in Figure 16 The node of the immediate number 13, thereby simplifying the abstract syntax tree.
  • Figure 17 shows a reduce operation on a node representing immediate multiplication by a variable.
  • Fig. 18 shows that the nodes representing variables are simplified, that is, the nodes representing a, b and -a in the abstract syntax tree on the left of Fig. 18 are merged, so as to obtain the An abstract syntax tree in which the leaf nodes representing "b" are directly connected to the root node.
  • a feature-based hash value can be generated for the nodes in the abstract syntax tree, and based on the generated hash value. Sort. When the same hash value appears, it can be considered that the nodes have the same kind of item, so that the abstract syntax tree can be simplified by merging. Taking Figure 17 as an example, the feature-based hash operation is performed on the nodes in Figure 17, since the coefficients are not considered, the two variables "3a" and "a" have the same hash value. Therefore, it is judged that the two nodes are similar items, which can be simplified by merging, so as to obtain the node representing "4a".
  • the abstract syntax tree on the left of FIG. 19 shows that the nodes representing the "x" and "y" variables are connected by the addition operation of the root node.
  • the coefficient of the "y" node is 0, it means that the branch is completely eliminated after simplification, so the abstract syntax tree shown in the right part of Figure 13 can be obtained.
  • the abstract syntax tree in the left part of FIG. 20 shows that the leaf nodes representing the "c" variable are connected to the root node by the child nodes of the multiplication operation. Since the leaf node representing the "c" variable is the only child node for the child node representing the multiplication operation, this branch can be eliminated to obtain the abstract syntax tree shown in the right part of Figure 20.
  • the conversion operations performed by the Abstract Syntax Tree in the present disclosure have been described above with reference to FIGS. 13-20 .
  • a comparison and judgment can be made based on the converted first abstract syntax tree and the second abstract syntax tree, so as to determine whether the first dimension and the second dimension can be folded based on whether they are the same or not.
  • the present disclosure will describe how to determine that two abstract syntax trees are the same abstract syntax tree with reference to FIG. 4 .
  • the upper part of FIG. 4 is the first abstract syntax tree in the context of the present disclosure, which may be constructed during the calculation process of tracing forward along the intermediate representation of the neural network model to obtain the span of the first dimension (such as the aforementioned W dimension).
  • the lower part of FIG. 4 is the second abstract syntax tree of the context of the present disclosure, which can be traced back along the intermediate expression of the neural network model during the calculation process to obtain the span and size of the second dimension (such as the aforementioned C dimension) built by.
  • the first abstract syntax tree can represent an expression of "(a ⁇ b)+(c ⁇ d)+(e ⁇ f ⁇ g)+(-h)" (which, for example, etc. valence to W stride ), while the second abstract syntax tree may represent "(-h)(b ⁇ a)+(c ⁇ d)+(e ⁇ f ⁇ g)" (which is equivalent to "C stride ⁇ C size ") expression.
  • the present disclosure proposes to use a fingerprint-based hash operation to determine the first Whether all child nodes under the root node and all child nodes under the second root node are equivalent to each other.
  • hash operations based on fingerprints may be performed on the variables represented by all child nodes under the first root node and the second root node respectively.
  • the operation results of the hash operation may be sorted.
  • child nodes with the same operation result under the first root node and the second root node after sorting can be judged to be equivalent to each other.
  • the hash value of a node in the abstract syntax tree is equal to the sum of the products of the hash values of all child nodes connected to it and their respective coefficients.
  • a child node that is a constant its hash value is itself, and for a leaf node, its hash value is the pointer value of its saved operand.
  • the above-mentioned embodiments describe the dimensionality folding process of a single tensor data.
  • the tensor data may be data involved in a certain operation.
  • the folding scheme of the present disclosure when the data participating in the operation (“op") involves more than two multi-dimensional tensor data, and more than two multi-dimensional tensor data have their respective corresponding dimensions that are foldable, then For the correct execution of the operation, more than two multi-dimensional tensor data are folded in the same way.
  • the operations include but are not limited to arithmetic operations (such as four arithmetic operations such as addition, subtraction, multiplication, and division, convolution operations, activation operations, full connection operations, etc.) and memory access operations (such as copy operations). Whether each multi-dimensional tensor data can be dimensionally folded in the same operation can refer to the above-mentioned embodiment.
  • the operation is an addition operation (that is, the add operator), and the addition operation has two operands
  • the two operands are two tensor data of A and B
  • the two tensor data of A and B have HWC data format
  • the B tensor data when the A tensor data is folded into the shape of H*(WC), the B tensor data must also be folded into the shape of H*(WC) in order to perform the addition operation.
  • the input data of the addition operation changes (in this example, the data read changes due to the shape change of the A tensor data) As a result, calculations cannot be performed or errors occur.
  • an operation instruction corresponding to the operation can be generated.
  • the operation instruction can be an operation instruction that the artificial intelligence processor can directly execute.
  • the tensor data can be read and calculated at the granularity of the folded tensor data, thereby improving the The processing efficiency of this tensor data.
  • the source data and the target data obtained by the copy operation also need to have the same folding method.
  • the copy operation is used to copy the A tensor data from the off-chip storage space DDR to the on-chip storage space.
  • the A tensor data is the source data stored on the DDR and has a data shape of 4*3*2 (corresponding to the HWC dimension).
  • FIG. 6 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It can be understood that the structure and composition shown in FIG. 6 is only an example, which is not intended to limit the solution of the present disclosure in any way.
  • the board 10 includes a chip 101, which may be a system-on-chip (System on Chip, SoC), that is, a system-on-chip described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices.
  • the aforementioned combination processing device can be an artificial intelligence computing unit, which is used to support various deep learning and machine learning algorithms, and meet the intelligent processing requirements in complex scenarios in the fields of computer vision, voice, natural language processing, data mining, etc., especially in-depth Learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligent applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 10 of this embodiment is suitable for cloud intelligent applications and has huge off-chip storage and on-chip storage. and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a WI-FI interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be sent back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 may also include a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected and data transmitted with the control device 106 and the chip 101 through the bus.
  • the control device 106 in the board 10 may be configured to regulate the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 7 is a configuration diagram showing a combination processing device in the chip 101 according to the above-described embodiment.
  • the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a DRAM (Dynamic Random Access Memory, DRAM) DRAM 204.
  • DRAM Dynamic Random Access Memory
  • the computing device 201 can be configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it can be used to perform deep learning or machine learning calculations, and can also interact with the processing device 203 through the interface device 202 to jointly complete operations specified by the user. In one implementation scenario, a computing device herein may be configured to perform a convolution operation or a matrix multiplication operation in the context of the present disclosure.
  • the interface device 202 can be used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 can obtain input data from the processing device 203 via the interface device 202 (such as various types of data related to neural network operations in the context of the present disclosure, including folded tensor data), and write it into the computing device 201 on-chip storage device.
  • the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202 and write them into the control cache on the chip of the computing device 201 .
  • the interface device 202 may also read data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201 .
  • the processing device 203 may be one or more types of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or other general and/or special purpose processors.
  • Processors including but not limited to Digital Signal Processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present disclosure can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the processing device 203 in the heterogeneous multi-core structure can compile the instruction code embodying the neural network model to form a binary instruction sequence executable by the computing device 201 .
  • the DRAM 204 is used to store data to be processed, and in an implementation scenario, it can be a DDR memory, and its size can generally be 16G or larger, and is used to save data of the computing device 201 and/or the processing device 203.
  • FIG. 8 shows a schematic diagram of the internal structure of the computing device 201 as a single core.
  • the single-core computing device 301 is used to process input data such as computer vision, speech, natural language, data mining, etc.
  • the single-core computing device 301 includes three modules: a control module 31 , an operation module 32 and a storage module 33 .
  • the control module 31 is used to coordinate and control the work of the operation module 32 and the storage module 33 to complete the task of deep learning, which includes an instruction fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decoding unit (Instruction Decode Unit, IDU) 312.
  • the instruction fetching unit 311 is used to obtain instructions from the processing device 203 , and the instruction decoding unit 312 decodes the obtained instructions and sends the decoding results to the computing module 32 and the storage module 33 as control information.
  • the instruction here may be a general convolution instruction for performing matrix multiplication or convolution operation.
  • the operation module 32 includes a vector operation unit 321 and a matrix operation unit 322 .
  • the vector operation unit 321 can be used to perform vector operations, and can support relatively complex operations such as vector multiplication, addition, and nonlinear transformation.
  • the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, that is, the matrix multiplication and convolution operations mentioned in the context of this disclosure.
  • the storage module 33 can be used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333.
  • NRAM 331 is used to store input neurons, output neurons, and intermediate results after calculation;
  • WRAM 332 is used to store convolution kernels of deep learning networks, that is, weights;
  • DMA 333 is connected to DRAM 204 through bus 34 and is responsible for single-core calculations Data transfer between the device 301 and the DRAM 204.
  • FIG. 9 shows a schematic diagram of the internal structure of the computing device 201 as a multi-core.
  • the multi-core computing device 41 may adopt a layered structure design and may operate as a system-on-chip, which may include at least one cluster (cluster), and each cluster includes multiple processor cores.
  • the multi-core computing device 41 is constituted at the level of SoC-cluster-processor core.
  • the multi-core computing device 41 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and multiple clusters 405 .
  • the peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to execute tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals among the modules.
  • the synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC Global Barrier Controller
  • the plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41 . Although 4 clusters are exemplarily shown in FIG. 9 , with the development of hardware, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405 . In an application scenario, the cluster 405 can be used to efficiently execute deep learning algorithms.
  • each cluster 405 may include multiple processor cores (IPU core) 406 and one storage core (MEM core) 407.
  • IPU core processor cores
  • MEM core storage core
  • the number of processor cores 406 is exemplarily shown in the figure as four, and the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in FIG. 10 .
  • Each processor core 406 is similar to the single-core computing device 301 in FIG. 8 , and may also include three modules: a control module 51 , an operation module 52 and a storage module 53 .
  • the functions and structures of the control module 51 , computing module 52 and storage module 53 are roughly the same as those of the control module 31 , computing module 32 and storage module 33 , and will not be repeated here.
  • the storage module 53 may include an input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533 and a moving direct memory access module (Move Direct Memory Access, MVDMA) 534.
  • IODMA 533 controls memory access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409;
  • MVDMA 534 is used to control memory access of NRAM 531/WRAM 532 and storage unit (SRAM) 408.
  • the storage core 407 is mainly used for storage and communication, that is, storing shared data or intermediate results between the processor cores 406, executing communication between the cluster 405 and the DRAM 204, communication between the clusters 405, and processor cores Communication between 406 and so on.
  • the storage core 407 may have a scalar operation capability to perform scalar operations.
  • the storage core 407 may include a static random access memory (Static Random-Access Memory, SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 410 and a global direct memory access module (Global Direct Memory Access , GDMA) 411.
  • SRAM static random access memory
  • CDMA Cluster Direct Memory Access
  • GDMA global direct memory access module
  • the SRAM 408 can assume the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the DRAM 204 through the processor cores 406 respectively, but is transferred between the processor cores 406 through the SRAM 408.
  • the storage core 407 only needs to quickly distribute the multiplexed data from the SRAM 408 to multiple processor cores 406, thereby improving the efficiency of inter-core communication and significantly reducing on-chip and off-chip input/output access.
  • the broadcast bus 409, the CDMA 410 and the GDMA 411 are respectively used to perform communication between the processor cores 406, communication between the clusters 405, and data transmission between the clusters 405 and the DRAM 204. They will be described separately below.
  • the broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405 .
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (for example, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406, and broadcasting is to transmit a data
  • the communication mode in which data is transmitted from SRAM 408 to all processor cores 406 belongs to a special case of multicast.
  • the CDMA 410 is used to control the memory access of the SRAM 408 between different clusters 405 in the same computing device 201.
  • the GDMA 411 cooperates with the external memory controller 401 to control memory access from the SRAM 408 of the cluster 405 to the DRAM 204, or to read data from the DRAM 204 to the SRAM 408.
  • the communication between the DRAM 204 and the NRAM 531 or WRAM 532 can be realized in two ways.
  • the first way is to directly communicate with DRAM 204 and NRAM 531 or WRAM 532 through IODAM 533; the second way is to first transmit data between DRAM 204 and SRAM 408 through GDMA 411, and then make data transfer between SRAM 408 and SRAM 408 through MVDMA 534 Transfer between NRAM 531 or WRAM 532.
  • the second way may require more components to participate and the data flow is longer, in fact, in some embodiments, the bandwidth of the second way is much larger than the first way, so the second way is used to implement the DRAM 204 Communication with NRAM 531 or WRAM 532 may be more efficient. It can be understood that the data transmission methods described here are only exemplary, and those skilled in the art can flexibly select and apply various data transmission methods according to the specific arrangement of hardware according to the teaching of the present disclosure.
  • the function of GDMA 411 and the function of IODMA 533 can be integrated in the same component.
  • the present disclosure regards GDMA 411 and IODMA 533 as different components for the convenience of description, for those skilled in the art, as long as the functions and technical effects achieved are similar to those of the present disclosure, they belong to the protection scope of the present disclosure .
  • the function of GDMA 411, the function of IODMA 533, the function of CDMA 410, and the function of MVDMA 534 can also be realized by the same part.
  • FIG. 11 shows a design diagram of the hardware and software architecture of data flow programming in an embodiment of the present disclosure.
  • the hardware and software architecture in this embodiment may include an artificial intelligence ("AI") processor 601, a driver and an operating system 602, a compiler and a compiled language 603, a library 604, a framework layer 605 and Application layer 606 .
  • AI artificial intelligence
  • the AI processor 601 considers both calculation optimization and data handling optimization in hardware design. To this end, it uses a customized computing unit to accelerate computing, and uses on-chip storage to accelerate data handling, thereby achieving extremely high performance and energy efficiency.
  • the AI processor 601 may have a customized operation unit and an instruction set, where the instruction set may provide operation instructions (scalars, vectors and/or matrices) of different granularities.
  • the AI processor of the present disclosure can achieve a speed several tens of times higher than that of a mainstream GPU (Graphics Processing Unit).
  • the driver and operating system 602 is mainly responsible for scheduling tasks on the AI processor 601 .
  • the scheduling operation may involve allocating and releasing device memory, scheduling according to task priority, communication and synchronization between multiple devices, and the like.
  • For the compiled program it can implement the scheduled execution of the task to be implemented on a specific processor through the operating system and the driver, including but not limited to the following operations: allocation, release of device memory, data transmission between devices, maintenance tasks Queues, and scheduling tasks according to priority, to achieve synchronization and collaboration among multiple devices.
  • the compiler and compiled language 603 may be a set of assembly language developed for the instruction set of the AI processor 601 . In the application, it can translate the deep learning operator developed for the AI processor 601 into a combination of processor instructions, so as to call the AI processor 601, so as to use the AI processor 601 efficiently.
  • Libraries 604 may include runtime libraries 614 and machine learning libraries 624 .
  • the aforementioned library 604 may use the instruction set of the AI processor 601 and perform partial optimization according to the instruction set of the AI processor 601, so as to increase the operating speed of the operator.
  • the runtime library 614 may be a set of high-performance operator libraries specially developed for the AI processor 601, and it may be used to complete the interaction between the general-purpose processor and the artificial intelligence processor. Further, the runtime library 614 can also provide a set of interfaces for artificial intelligence processors.
  • the machine learning library 624 it can be used to accelerate various machine learning or deep learning algorithms on artificial intelligence processors.
  • the machine learning library 624 can provide a set of efficient, general, flexible and extensible programming interfaces, and its upper layer machine learning applications can directly adopt programming interfaces of various programming frameworks (such as TensorFlow, Caffe, MXNet, etc.), It is also possible to use the interface provided by the machine learning library 624 for direct programming.
  • the machine learning library 624 of the present disclosure can be conveniently called by the hardware platform, and the runtime library 614 can implement some basic common operators, such as various operations such as convolution and pooling.
  • the framework layer 605 can increase the encapsulation of operators developed for AI processors, and mainly the encapsulation of operators of the runtime library 614 .
  • the framework layer 605 can also modify related tasks such as task scheduling or memory management.
  • the application layer 606 can be an application platform provided by developers of deep learning algorithms, and based on the native framework layer 605, the support for invoking the AI processor 601 is expanded during model runtime.
  • the framework layer 605 can realize the encapsulation and support of the operators in the high-performance operator library in the runtime library 614, and it mainly uses the data flow graph to construct the calculation process of the deep learning model according to the graph optimization mechanism.
  • the hardware architecture, the software architecture and the combination of the two and their internal structures of the present disclosure are described in detail above with reference to FIGS. 6-11 . It can be understood that the above description is only exemplary and non-restrictive, and according to different application scenarios and hardware specifications, those skilled in the art can also make changes to the aforementioned boards and internal structures of the present disclosure, and these changes are still fall within the protection scope of the present disclosure. Based on the aforementioned hardware and software architecture, the dimension processing solution proposed in the present disclosure is also described in conjunction with FIGS. 12-20 . By utilizing the dimensional processing scheme of the present disclosure, the processing efficiency of multi-dimensional tensor data can be improved, especially the operating efficiency of the aforementioned software and hardware architectures of the present disclosure when performing data copying can be improved.
  • the electronic equipment or integrated circuit device disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, Internet of Things terminals , mobile terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, earphones, mobile storage, wearable devices, visual terminals, automatic driving terminals, vehicles, household appliances, and/or or medical equipment.
  • Said vehicles include airplanes, ships and/or vehicles; said household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; said medical equipment includes nuclear magnetic resonance instruments, Ultrasound and/or electrocardiograph.
  • the electronic equipment or integrated circuit device disclosed in the present disclosure can also be applied to fields such as the Internet, the Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and medical treatment. Further, the electronic device or integrated circuit device disclosed in the present disclosure can also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, and terminal.
  • electronic devices or devices with high computing power according to the disclosed solutions can be applied to cloud devices (such as cloud servers), while electronic devices or integrated circuit devices with low power consumption can be applied to terminal devices and and/or edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that according to the hardware information of the terminal device and/or the edge device, the hardware resources of the cloud device can be Match appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-end integration.
  • the present disclosure expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present disclosure is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present disclosure, those skilled in the art may understand that certain steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present disclosure can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily required for the realization of one or some solutions of the present disclosure. In addition, according to different schemes, the description of some embodiments in this disclosure also has different emphases. In view of this, those skilled in the art may understand the part that is not described in detail in a certain embodiment of the present disclosure, and may also refer to related descriptions of other embodiments.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit.
  • the aforementioned components or units may be located at the same location or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present disclosure.
  • multiple units in the embodiments of the present disclosure may be integrated into one unit, or each unit exists physically independently.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer readable memory. Based on this, when the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, and it can include several instructions to make a computer device (such as a personal computer, a server, or Network devices, etc.) execute some or all of the steps of the methods described in the embodiments of the present disclosure.
  • a computer device such as a personal computer, a server, or Network devices, etc.
  • the aforementioned memory may include but not limited to U disk, flash disk, read-only memory ("Read Only Memory”, abbreviated as ROM), random access memory (“Random Access Memory”, abbreviated as RAM), mobile hard disk, magnetic disk Or various media such as CDs that can store program codes.
  • ROM read-only memory
  • RAM random access memory
  • CDs compact discs
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP, and ASIC.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory ("Resistive Random Access Memory”, abbreviated as RRAM), dynamic random access memory (“Dynamic Random Access Memory”, abbreviated as DRAM), static random access memory (“Static Random Access Memory”, abbreviated as SRAM), enhanced dynamic random access memory (“Enhanced Dynamic Random Access Memory”, abbreviated as "EDRAM”), high bandwidth memory (“High Bandwidth Memory”, abbreviated as "HBM”), hybrid memory cube ("Hybrid Memory Cube”, abbreviated as "HMC”), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • EDRAM enhanced dynamic random access memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • a compilation optimization method comprising:
  • the dimension information includes a dimension value of at least one dimension of the tensor data to be processed
  • the dimension collapsing information includes collapsible dimensions
  • the determining the dimension folding information of the tensor data to be processed according to the dimension information of the tensor data to be processed includes:
  • the dimensional information of the tensor data to be processed determine whether adjacent dimensions of the tensor data to be processed are continuous;
  • the adjacent dimensions of the tensor data to be processed are continuous, it is determined that the adjacent dimensions are collapsible dimensions.
  • Clause A3 According to the method described in Clause A2 or A1, the adjacent dimensions of the tensor data to be processed include a first dimension and a second dimension, the method further comprising:
  • step size between the data in the first dimension direction is equal to the product of the step size between the data in the second dimension direction and the size of the data in the second dimension direction, then determine the tensor data to be processed Adjacent dimensions are continuous.
  • the dimension folding information further includes a dimension folding index; and according to the dimension information of the tensor data to be processed, the Dimension collapse information, also includes:
  • Clause A5 According to the method described in any one of clauses A1-A4, performing dimensionality reduction processing on the tensor data to be processed according to the dimension folding information of the tensor data to be processed includes:
  • the folded tensor dimension is determined according to the dimension folding information of the tensor data to be processed, so as to realize the dimension reduction processing of the tensor data to be processed;
  • the folded tensor dimension is equal to the product of the dimension values of the foldable dimensions of the tensor data to be processed.
  • Clause A6 The method of any one of Clauses A1-A5, further comprising:
  • Clause A7 According to the method described in any one of Clauses A1-A6, the tensor data to be processed points to an operand in an operation; the method further includes:
  • Target dimension folding information according to dimension folding information corresponding to at least one operand in the operation; the target dimension folding information is used to indicate a dimension folding manner of the at least one operand in the operation.
  • the target dimension folding information includes a target collapsible dimension and a target dimension folding index; the target dimension is determined according to the dimension collapsing information corresponding to at least one operand in the operation Collapse information, including:
  • the target dimension fold index is determined based on the target collapsible dimension and a dimension fold index of the at least one operand.
  • Clause A9 The method of any one of clauses A1-A8, further comprising:
  • Clause A10 The method of any one of clauses A1-A9, further comprising:
  • Clause A11 The method of any one of Clauses A1-A10, further comprising:
  • the target folding information perform dimension expansion on the result data, so that the expression form of the result data is consistent with the initial expression form of the operand.
  • a computer device comprising:
  • the processor is configured to execute the computer-executable program to implement the method described in any one of clauses A1-A11.
  • Item A13 A computer-readable storage medium on which is stored a computer program, characterized in that, when the computer program is executed by a processor, the method described in any one of Items A1-A11 is implemented.
  • a method for processing multidimensional tensor data comprising:
  • Clause B2 The method of clause B1, further comprising:
  • Whether the first expression is equal to the second expression is judged by judging whether the first abstract syntax tree is the same as the second abstract syntax tree.
  • Clause B3 The method of clause B2, wherein the first expression is based on a span of a first dimension of the multidimensional tensor data in the neural network model, and the second expression is based on a span in the neural network model The stride and size of the second dimension of multidimensional tensor data.
  • Clause B4 The method of Clause B3, wherein obtaining the first abstract syntax tree comprises:
  • Clause B5. The method of Clause B3, wherein obtaining the second abstract syntax tree comprises:
  • Clause B6 The method described in clause B4 or B5, further comprising stopping said retroactively in response to retroactively occurring one of the following:
  • Clause B7 The method according to clause B4 or B5, wherein before determining whether the first abstract syntax tree and the second abstract syntax tree are identical, the method further comprises:
  • Clause B8 The method of Clause B7, wherein transforming the first abstract syntax tree or the second abstract syntax tree comprises:
  • Clause B9 The method of Clause B7, wherein transforming the first abstract syntax tree or the second abstract syntax tree comprises:
  • the root node and child nodes are merged.
  • Clause B10 The method of Clause B7, wherein transforming the first abstract syntax tree or the second abstract syntax tree further comprises performing a reduce operation on the first abstract syntax tree or the second abstract syntax tree.
  • Clause B11 The method of Clause B10, wherein the reduction operation comprises performing one of the following reduction operations on the first abstract syntax tree or the second abstract syntax tree:
  • Clause B12 The method of clause B11, wherein in performing the merger, the method comprises:
  • Clause B13 The method of Clause B7, wherein converting the first abstract syntax tree or the second abstract syntax tree further comprises:
  • a single branch with only one node in the first abstract syntax tree or the second abstract syntax tree is eliminated.
  • Clause B14 The method of any one of clauses B8-B13, wherein the first abstract syntax tree or the second abstract syntax tree is a hierarchical structure of nodes, wherein the first abstract syntax tree and Whether the second abstract syntax tree is the same includes for the converted first abstract syntax tree and/or the second abstract syntax tree:
  • Clause B15 The method of Clause B14, wherein determining whether the first abstract syntax tree and the second abstract syntax tree are identical comprises:
  • Clause B16 The method of Clause B15, wherein determining whether all child nodes under the first root node and all child nodes under the second root node are equivalent to each other comprises:
  • the child nodes with the same operation result under the first root node and the second root node after sorting are judged to be equivalent to each other.
  • Clause B17 The method of clause B14, further comprising:
  • An apparatus for processing multidimensional tensor data comprising:
  • a memory storing program instructions for processing multidimensional tensor data, which when executed by the processor implements the method according to any one of clauses B1-B17.
  • Clause B19 A computer-readable storage medium storing program instructions for processing multidimensional tensor data, which, when executed by a processor, implements the method of any one of clauses B1-B17.
  • An integrated circuit device comprising:
  • a memory storing binary program instructions obtained by compiling program instructions associated with a neural network model, wherein said neural network model is optimized by a method according to any one of clauses B1-B17,
  • computational task associated with the neural network model is performed when the binary program instructions are executed by the processor.
  • Clause B21 A board comprising the integrated circuit device according to Clause B20.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开涉及一种编译优化方法、计算机设备以及存储介质。该方法能够确定张量的维度折叠信息,并根据该张量维度折叠信息对张量数据进行维度折叠,从而可以在优化张量数据的读写访问,进而可以优化目标处理器在运行阶段对张量数据的数据处理过程。同时,本公开的编译优化方法,可以减少程序中的循环次数,降低程序中的控制流,提高程序的执行效率。

Description

编译优化方法、装置、计算机设备以及存储介质
相关申请的交叉引用
本公开要求如下申请的优先权:于2021年9月3日申请的、申请号为2021110332978、发明名称为“编译优化方法、装置、计算机设备以及存储介质”的中国专利申请;于2021年9月3日申请的、申请号为2021110338762、发明名称为“一种用于处理多维张量数据的方法及其相关产品”的中国专利申请。
技术领域
本公开涉及计算机技术领域,特别是涉及一种编译优化方法、装置、计算机设备以及存储介质。
背景技术
在任务(例如,图像处理任务、视频处理任务、文本处理任务和语音处理任务等)处理过程中,通常涉及对任务中的数据进行处理,这些数据可能是数值和张量,其中,张量可以包括一维张量(即向量)、二维张量(即矩阵)以及更高维度的张量)等。在数据较复杂的情况下,多维张量的处理通常可通过多维循环的方式来处理,但多维循环的处理方式引入了更多控制流,增大了处理资源的占用,且并不能充分利用处理器和存储器的带宽,导致处理效率较低。
当前,“张量数据”被广泛应用于各类应用场景中,特别是包括深度机器学习的人工智能领域的计算场景中。具体来说,张量可以用于表达一个变化量或常量,其可以具有零个或多个维度。例如,零维张量是标量(如数值),即常量;一维张量是数值和方向组合的向量;二维张量是向量的组合(即矩阵);三维张量是数据立方体(即矩阵的组合);而四维张量可以是数据立方体的组合,以此类推。对于多维(例如大于或等于二维)张量数据来说,其维度值可以是变量,即该张量数据可以是变化量,其维度设置和基于维度设置的数据“摆放”将对数据的存储方式和读取方式产生重要的影响,并且由此显著影响计算效率和计算平台的整体性能。鉴于此,如何对多维张量数据进行维度处理以实现数据的高效读取成为亟需解决的技术问题。
发明内容
基于此,有必要针对上述技术问题,提供一种能够提升运算效率的编译优化方法、装置、计算机设备和存储介质。
根据本公开的第一方面,提供了一种编译优化方法,包括:获取待处理张量数据的维度信息,其中,所述维度信息包含所述待处理张量数据的至少一个维度的维度值;根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息;根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行维度折叠。
在第二方面中,本公开提供了一种用于处理多维张量数据的方法,该方法由处理器执行,并且包括获取与所述多维张量数据中的第一维度关联的第一表达式;获取与所述多维张量数据中的第二维度关联的第二表达式,其中所述第一维度和第二维度是多维张量数据的相邻维度;以及判断第一表达式与第二表达式是否相等,以便确定所述第一维度和所述第二维度是否能够进行维度折叠。
本公开第三方面还提供了一种计算机设备,包括:处理器;用于存储处理器计算机可执行程序的存储器;其中,所述处理器被配置为执行所述计算机可执行程序,以实现上述任意一项所述的方法。
本公开第四方面还提供了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述任意一项所述的方法。
在第五方面中,本公开提供了一种集成电路装置,包括:处理器,其用于执行与神经网络模型关联的计算任务;以及存储器,其存储有对与神经网络模型关联的程序指令进行编译后所获得的二进制程序指令,其中所述神经网络模型经由根据上述以及下文多个实施例中所讨论的方法进行优化,其中当所述二进制程序指令由所述处理器运行时,执行与所述神经网络模型关联的所述计算任务。
本公开的编译优化方法,通过张量数据的维度信息确定张量的维度折叠信息(包括维度折叠方向以及可折叠维数等),并根据该张量维度折叠信息对张量数据进行维度折叠,从而可以在优化张量数据的读写访问,进而可以优化目标处理器在运行阶段对张量数据的数据处理过程。同时,本公开的编译优化方法,可以减少程序中的循环次数,降低程序中的控制流,提高程序的执行效率。
通过利用本公开上述多个方面中所公开的技术方案,可以对多维张量数据在维度方面进行处理,从而可以实现多维张量数据的维度折叠。通过这样的维度折叠,本公开的方案可以实现对多维张量数据的“降维”操作,从而可以方便在例如硬件平台的片上与片外之间的数据搬运或迁移,最大程度地发挥数据传输的带宽。由此,可以显著提升数据传输带宽的利用率,并且因而提升硬件平台运算时的整体性能。
附图说明
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:
图1为本公开实施例中用于执行编译优化方法的处理器的示意图;
图2为本公开一个实施例的编译优化方法的流程图;
图3为本公开实施例的张量数据维度折叠的一个示意图;
图4为本公开实施例中通过语法树判断张量数据的维度是否连续的示意图;
图5为本公开另一实施例的编译优化方法的流程图;
图6是示出根据本公开实施例的板卡的结构图;
图7是示出根据本公开实施例的集成电路装置的结构图;
图8是示出根据本公开实施例的单核计算装置的内部结构示意图;
图9是示出根据本公开实施例的多核计算装置的内部结构示意图;
图10是示出根据本公开实施例的处理器核的内部结构示意图;
图11是示出根据本公开实施例的软硬件架构的结构图;
图12是示出根据本公开实施例的用于处理多维张量数据的方法的流程图;
图13是示出根据本公开方案转换抽象语法树的一个实施例的示意图;
图14是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图15是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图16是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图17是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图18是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图19是示出根据本公开方案转换抽象语法树的又一个实施例的示意图;
图20是示出根据本公开方案转换抽象语法树的又一个实施例的示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例,并且所描述的多个实施例可以根据场景进行适当的组合以实现不同的应用。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
应当理解,本披露的权利要求、说明书及附图中使用的术语“第一”、“第二”和“第三”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
本公开实施例的编译优化方法可应用于处理器中,该处理器一般是通用处理器,例如CPU(Central Processing Unit,中央处理器)。如图1所示,处理器1-100包括多个处理单元1-101以及存储单元1-102,多个处理单元1-101用于执行指令序列,存储单元1-102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理器1-100中的多个处理单元1-101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间。可选地,该处理器上可以运行一用于执行该编译优化方法的编译器,编译器可以是运行在处理器上的计算机程序。该编译器用于将接收到的源程序转换为目标处理器能够运行的目标代码,例如目标处理器能够直接运行的指令。可选地,本公开的编译器可以通用编译器,也可以是能够编译神经网络计算图的神经网络编译器,如TVM,Mindspore等等。
该目标处理器可以是上述的通用处理器,也可以是用于执行人工智能运算的人工智能处理器(IPU)。其中,人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network  Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。在一种可能的实现方式中,本公开中所提及的目标处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。
随着目标处理器的计算能力的提高,目标处理器能够处理的数据的粒度越来越大。例如,目标处理器为能够进行并行计算的处理器,该目标处理器可以对向量或矩阵等粒度的张量数据进行处理。具体地,在目标处理器进行卷积运算任务、池化任务或全连接等神经网络运算时,通常可以采用向量、矩阵甚至更大粒度的张量数据进行运算。例如,以三维特征图为例,该三维特征图的形状可以表示为HWC,其中,H表示张量数据的高度,W表示张量数据的宽度,而C表示张量数据的通道数。这里,H维度可以认为是最高维度,C维度可以认为是最低维度,W维度是处于最高维度和最低维度之间的中间维度。在其他实施例中,该张量数据还可以是四维张量数据,其可以表示为NHWC,其中N表示批处理数(Batch size),N可以认为是最高维度。传统技术中,在对多维张量进行处理的过程中,通常采用多维循环的方式。对于上述的三维张量数据可进行H×W×C次循环,每次处理一个维度方向的向量(即,读取某个维度方向上的多个数据)。但循环语句会在编译时引入更多控制流,从而导致运行阶段需要多次数据访存,从而带来了多次读写开销,导致目标处理器对于张量数据的处理效率下降。
针对上述存在的技术问题,本公开提供了一种编译优化方法,可以根据张量数据的维度信息确定张量的维度折叠信息(包括维度折叠方向以及可折叠维数等),并根据该张量维度折叠信息对张量数据进行维度折叠,以根据维度折叠后的张量数据生成操作指令(其中,该操作指令包括但不限于访存指令和运算指令),从而可以在优化张量数据的读写访问,进而可以优化目标处理器在运行阶段对张量数据的数据处理过程。如图2所示,所述编译优化方法包括:
S210、获取待处理张量数据的维度信息,其中,所述维度信息包含所述待处理张量数据的至少一个维度的维度值;
S220、根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息;
S230、根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行维度折叠。
本公开实施例中,待处理张量数据(即多维张量数据)可以是二维或二维以上的张量数据,并且维度的排列可以具有一定的顺序,在一个实施例中,本公开的待处理张量数据可以是三维或四维张量数据。每个维度方向上可以具有一维度值,表示该维度方向上的数据量(包括但不限于数据的个数,数据的位宽数等等)。以三维张量数据为例,该三维张量数据可以表示为HWC,其中,H维度方向上的数据量可以用m表示,W维度方向上的数据量可以用k表示,C维度方向上数据量可以用n表示,此时该三维张量数据的形状可以表示为m×k×n,其中,m,k,n分别表示该三维张量在三个维度方向的维度值,若m=4,k=4,n=4,则表示该三维张量数据在每个维度方向 上的维度值均为4。
可选地,待处理张量数据的维度信息可以是用户在编程过程中通过张量数据的定义确定的。其中,用户可以通过张量数据的定义方式直接将张量数据的某一维度值赋值为常量(如立即数),用户还可以通过变量声明的方式确定张量数据的维度值。进一步地,本公开实施例中,当张量数据的维度值为常量时,此时编译器可以将该张量数据的形状替换为常量表示的形式。承接上例,若三维张量数据的三个维度方向上的维度值分别为:m=4,k=4,n=4,则编译器可以将三维张量数据的形状表示为4×4×4。当张量数据的维度值包括常量和变量时,此时编译器可以将该张量数据中相应维度的维度值替换为常量表示的形式。例如,若三维张量数据的三个维度方向上的维度值分别为:m=4,k=4,n,则编译器可以将三维张量数据的形状表示为4×4×n,其中n为变量。
之后,本公开实施例的编译器可以根据该待处理张量数据的维度信息,确定该待处理张量数据的相邻维度是否连续,以确定该待处理张量数据的维度折叠信息。其中,待处理张量数据的维度折叠信息包括但不限于,维度折叠方向(如可折叠维度)以及可折叠维度的数量。
可选地,编译器可以确定张量数据的相邻维度是否是连续的,如果该张量数据的相邻维度是连续的,则可以确定该相邻的维度可以被折叠。由于该多维张量数据在存储器中以一维方式按照维度顺序进行存储,例如,多维张量数据以高维度到低维度的顺序依次存储,因此,多维张量数据的同一维度方向上的不同起始元素之间存在一定的步长/跨度(stride),该步长/跨度可以等于某一维度方向上的维度值。基于此,如果多维张量数据中第一维度方向上数据之间的步长等于第二维度方向上数据之间的步长与第二维度方向上数据的尺寸之间乘积,则可以确定相邻的维度是连续的,其中,第一维度高于第二维度。举例而言,假设张量数据的相邻两个维度分别为:W维度和C维度,并且满足W stride=C stride×C size,则可以任务W维度和C维度是连续、且可被折叠。其中W stride表示W维度的步长,C stride和C size分别表示C维度的步长和尺寸,其中,W维度高于C维度。
在一种可能的实现方式中,对于多维张量数据,可逐次确定相邻的两个相邻维度之间是否连续。例如,首先确定三维张量数据的三个维度从低到高依次为第三维度(如C维度)、第二维度(如W维度)和第一维度(如H维度)。本公开可以首先确定第二维度和第三维度是否连续,在第二维度和第三维度连续的情况下,再按照上述方式判断第一维度和第二维度是否连续。上述示例中是按照从低维到高维的顺序来顺次判断相邻维度是否连续,在其他示例中,也可以按照从高维到低维的顺序来判断相邻维度是否连续。本公开仅以示例的方式进行说明,并不做具体限定。
在确定待处理张量数据的维度折叠信息之后,编译器可以根据维度折叠信息对待处理张量数据进行维度折叠处理。例如,本公开实施例可对多维张量进行降维,以获得维度较低,但每个维度的维数较高的张量数据。继续沿用上述示例,本公开实施例可以将上述三维特征图降维处理为二维张量,例如,降维处理后的张量数据可以表示为(m×k)×n,其中该降维后的二维张量数据可以其中一个维度值为(m×k),另一个维度值为n。这样当根据该降维后的张量数据进行处理时,仅需要两层嵌套循环,而不需要三层嵌套循环,从而减少了循环次数,减少控制流,提升处理效率。同时, 当根据该降维后的张量数据进行处理时,可以循环n次,每次处理m×k大小的数据量,从而可以目标处理器对该张量数据的访存开销和运算开销等。再如,降维后的张量数据可以是一维数据,其维度值可以为(m×k×n),本公开仅以示例的方式说明,并不限定本公开的具体实现方案。
例如,如图3所示,以NHWC数据摆放格式的四维张量数据为例,其中N和H互为相邻维度,其中N维度为第一维度而H维度为第二维度,其中,第一维度高于第二维度。类似地,H维度和W维度互为相邻维度,其中H维度为第一维度而W维度为第二维度。同样地,W维度和C维度为相邻维度,其中W维度为第一维度而C维度为第二维度。进一步标示出C 0~C 4、W 0~W 4和H 0-H 4的张量数据(其示出为一个大的立方体,其中所包含的每个小立方体代表具有最小粒度的数据块)来说,当试图将中间维度的W维度和最低维度的C维度进行维度折叠时,可以通过等式W stride=C stride×C size来确定,如上文所描述的。由于图中张量数据的W stride=4,C stride=1并且C size=4,即满足前述等式,因此可以判断C维度的数据在W维度上是连续的,因此可以执行折叠操作。通过对张量数据执行W维度和C维度的折叠操作,可以令张量数据具有图3的右部301所示出的数据形状(也即“降维”后的形状),从而方便后续的数据读取,显著减小代码实现过程中的嵌套循环操作(下文将详述)。具体地,第1数据块可以对应于坐标H 0W 0C 0,第2数据块可以对应于坐标H 0W 0C 1,以此类推,第四数据块可对应于坐标H 0W 0C 4,即第1~第4数据块对应于张量数据中左上角沿C维度排列的四个数据块。接着,第5数据块对应于坐标H 0W 1C 0,第6数据块对应于坐标H 0W 1C 1,第7数据块对应于坐标H 0W 1C 2,第8数据块对应于坐标H 0W 1C 3。以此类推,直至得到如图中所示出的沿H 0、H 1、H 2、和H 3的H维度所排列的四行数据。可以看出,此时W和C维度已经折叠成同一维数据,以便于数据的拷贝操作和运算操作。根据本公开的折叠方案,仍可以判断是否对H维度和该新维度进行折叠操作。具体地,由于H维度的跨度为16,而所述新维度的跨度为1且尺寸为16,因此仍满足本公开所设定的等式要求,即H维度的数据在折叠后的“WC”维度上是连续的。鉴于此,接着对前述得到的二维矩阵进行维度折叠,从而可以得到如302处示出的一组数组形式(或形状)的张量数据,也即沿H维度排列的从数据1~数据64的数组。通过对H维度和“WC”维度的维度折叠使数据立方体以一维形式呈现,
可选地,上述待处理张量数据可以是常量数据,也可以是变量数据。当该待处理张量数据的各个维度值均为常量时,则可以认为该待处理张量数据为常量数据;当该待处理张量数据的至少一个维度值为变量,则可以认为该待处理张量数据为变量数据。其中,待处理张量数据的各个维度值可以是用户通过赋值等方式确定的。当待处理张量数据的维度值为常量时,则可以将该待处理张量数据的相应维度值替换为常量。
例如,用户可以通过赋值方式指定待处理张量数据的维度值为常量:
%c64=constant 64:index;
%c128=constant 128:index;
其中,%c64,%c128分别表示待处理张量数据的维度值。
通过上述变量声明语句可以确定待处理张量数据的至少一个维度值中包含有常量,此时,可以将待处理张量数据的相应维度值替换为常量。承接上例,在维度值替 换之前,该待处理张量数据的表示形式为:
%c=std.view%1[0,0][%c64,%c128]:memref<32768xi8,101>to memref<?x?xi32,101>,其中的“?”表示变量。
在维度值替换之后,该待处理张量数据的表示形式可以为:
%c=std.view%i1[0,0][]:memref<32768xi8,101>to memref<64x128xi32,101>,从而可以看出,其中的维度值替换为相应的常量值。
可选地,当上述待处理张量数据为常量数据时,则可以根据该待处理张量数据的各个维度值确定其相邻维度是否连续。若相邻维度连续,则确定该相邻维度可以被折叠为一个维度。若相邻维度不连续,则确定相邻维度不能被折叠。进一步地,当待处理张量数据的相邻维度连续,且其相邻维度的维度值为常量时,则可以对该相邻维度进行折叠,并将该待处理张量数据以折叠后的张量维度进行表示。其中,该折叠后的张量维度值为连续的相邻维度的维度值的乘积。
例如,该待处理张量数据的表示形式的转换可以通过如下的折叠语句进行实现:
%c’=mlu.memrefcast%c memref<64x128xi32,101>to memref<8192xi32,101>,其中,%c’为折叠后的张量数据,该折叠后的张量数据的表示为8192xi32,折叠后的张量维度值8192=64×128。
可选地,当上述待处理张量数据为变量数据时,由于待处理张量数据的至少一个维度值为变量(该变量在编译时无法确定,该编译过程中该变量可以通过一表达式进行表示,其具体数值需要在运行时才能确定),此时无法通过其维度值直接确定相邻维度是否连续,此时编译器可以通过该待处理张量数据的相邻维度相关参数的表达式来确定其相邻维度是否连续。可选地,上述相邻维度相关参数的表达式可以转换为抽象语法树,编译器可以依据静态单赋值(SSA)分别确定出高维度的步长W stride以及低维度的C stride×C size的抽象语法树,并对这两个语法树进行化简和变换,使得这两个语法树具有相同的树形结构(例如该两个语法树的根节点也是加法操作,叶节点可以为乘法操作)。例如,对于上述语法树,本公开实施例可以将乘法运算对应的根节点或子节点向所述抽象语法树的下层方向移动;以及将加法运算对应的子节点和/或叶子节点向所述抽象语法树的上层方向移动,以获得具有相同树形结构的语法树。之后,编译器可以通过判断两个语法树是否相同来确定相邻维度是否连续,以确定相邻维度是否能够被折叠。
例如,图4示出了两个根据表达式确定的,经变换后获得两个语法树。假定图4上部是本公开的第一抽象语法树,其可以是操作的中间表达向前追溯获取第一维度(例如前述的W维度)的步长的计算过程期间所构建的。相对应地,图4下部是本公开的第二抽象语法树,其可以是沿操作的中间表达向前追溯获取第二维度(例如前述的C维度)的步长和尺寸的计算过程期间所构建的。通过判断图4中的第一抽象语法树和第二抽象语法树相同,即可以推断出前述的等式W stride=C stride×C size成立,从而判断出第一维度和第二维度是连续的,二者可以执行维度折叠操作以转换到一个维度上。
从图4中所示可以看出,第一抽象语法树可以代表“(a×b)+(c×d)+(e×f×g)+(-h)”的表达式(其例如等价于W stride),而第二抽象语法树可以代表“(-h)(b×a)+(c×d)+(e×f×g)”(其例如等价于“C stride×C size”)的表达式。由于第一抽象语法树和第二抽象语法树在 第二层和第三层上还存在差异,在一个实施场景中,本公开实施例还可以利用基于指纹的哈希运算来确定第一根节点下的所有子节点和第二根节点下的所有子节点是否彼此等价。就如图4所示例子来说,当第一抽象树和第二抽象树中连接根节点并代表“(a×b)”、“(c×d)”、“(e×f×g)”和“-h”的四个子树的哈希值都相同时,则可以判断两棵抽象语法树相同。由此,可以确定与两个抽象语法树关联的第一维度(W维度)和第二维度(C维度)是可以折叠的。
进一步地,当待处理张量数据的多个维度能够被折叠时,编译器还可以根据该能够被折叠的维度,获得待处理张量数据的折叠维度索引,该折叠维度索引用于表示待处理张量数据的最大连续的维度的索引。从而编译器可以获得哪些维度可以被折叠,以及能够折叠的维度数量(维度折叠索引)等维度折叠信息,以根据该维度折叠信息对待处理张量数据进行维度折叠。
例如,待处理张量数据为HWC,其总维度数为3。如果确定W维度和C维度连续,则可以确定W维度和C维度能够被折叠,可确定该待处理张量数据的维度折叠索引值为2。若H维度、W维度和C维度均连续,则可以确定H维度、W维度和C维度均能够被折叠,此时该待处理张量数据的维度折叠索引值为3,该维度折叠索引值等于其总维度数。
本公开中的待处理张量数据可以是参与运算操作和/或访存操作的数据块。可选地,在处理器执行任务的过程中,可能会处理多种类型的信息,例如,该任务所涉及的张量数据以及该任务所需执行的至少一个运算操作等。所述任务可包括图像处理任务、视频处理任务、文本处理任务和语音处理任务中的任意一种,本公开对预设任务的类型不做限制。当某一运算操作所需的操作数为两个以上时,例如四则运算操作需要两个操作数,此时,在对参与运算操作的待处理张量数据进行降维处理时,还需满足至少两个操作数具有相同的维度折叠信息,以满足运算的需求。其中,运算操作的每一个操作数可以是一个待处理张量数据。如图5所示,当操作中包含至少一个待处理张量数据时,该编译优化方法可以包括:
在步骤S510中,获取操作中至少一个待处理张量数据的维度信息,其中,所述维度信息包含所述张量数据的至少一个维度的维度值。
其中,上述操作包括运算操作和访存操作等,该运算操作包括但不限于乘法运算操作、加法运算操作、卷积运算操作、激活运算操作等运算操作。每个运算操作均可以对特定的张量数据进行运算,且每个运算操作所需的待处理张量数据至少为一个。该运算操作可以被编译成为对应的运算指令,目标处理器可以执行该运算指令以实现相应的运算操作。在神经网络等大数据等应用场景中,为提高数据效率,每个参与该运算操作的数据均可以是张量数据。其中,待处理张量数据可以是一维张量,即向量数据;该张量数据也可以是二维张量,即矩阵数据;该张量数据还可以是多维张量。
具体地,该张量数据可以包括至少一个维度的维度值,至少一个维度值用于描述该张量数据的形状及大小,至少一个维度值的乘积即为该张量数据的大小。例如,当张量数据为矩阵数据时,则该张量数据的维度值可以包括行维度值和列维度值,该张量数据可以表示为行维度值×列维度值,例如,64×128。当张量数据为三维数据时,该张量数据的维度值可以包括长度值、宽度值和高度值,该张量数据可以表示为长度值×宽度值×高度值,例如,64×128×32。
在步骤S520中,分别根据每个待处理张量数据的维度信息,确定各个所述待处理张量数据的维度折叠信息。
具体地,针对操作中的每个待处理张量数据,可以按照上述实施例的方式,分别确定每个待处理张量数据的相邻维度是否连续,以获得每个待处理张量数据的维度折叠信息。其中,每个待处理张量数据的维度折叠信息的确定方式具体可参见上文,此处不再赘述。
在步骤S530中,根据所述操作中所有待处理张量数据的维度折叠信息,确定目标维度折叠信息。其中,目标折叠信息可以包括操作中各个待处理张量数据的目标可折叠维度以及目标可折叠维度数量(即目标维度折叠索引)。
本公开实施例中,为了保证操作的正确执行,操作中涉及的所有待处理张量数据的折叠方式应当是相同的,即各个待处理张量数据需要的维度折叠方向一致且各个待处理张量数据的维度折叠索引值相同。基于此,上述目标可折叠维度信息的确定方法可以包括:
根据所述至少一个操作数的可折叠维度,确定目标可折叠维度,其中所述目标可折叠维度可以为所述至少一个操作数的可折叠维度的交集;以及
基于所述目标可折叠维度和所述至少一个操作数的维度折叠指引,确定所述目标维度折叠索引,其中该目标维度折叠索引可以是操作中所有待处理张量数据中在目标可折叠维度方向上的折叠维度索引的最小值。即本公开实施例中,不仅需要保证操作中至少一个操作数的折叠方向(目标可折叠维度)相同,还需要保证操作中目标可折叠维度的数量相同。
例如,上述的操作为加法操作,该加法操作涉及两个操作数(即加数和被加数),加数和被加数均为三维张量数据。其中,加数可以表示为H1W1C1,加数的维度折叠信息为:在H维度、W维度、C维度均可以被折叠,维度折叠索引值为3。而被加数可以表示为H2W2C2。被加数的维度折叠信息为:W维度、C维度可以被折叠,维度折叠索引值为2。则编译器可以确定目标折叠信息为:W维度、C维度可以被折叠,维度折叠索引值为2。再如,加数的维度折叠信息为:在H维度、W维度可以被折叠,维度折叠索引值为2。而被加数的维度折叠信息为:W维度、C维度可以被折叠,维度折叠索引值为2。则编译器可以确定目标折叠信息为:在任一维度均不可以被折叠,维度折叠索引值为0。
在步骤S540中,根据所述目标维度折叠信息,对所述操作中的所有待处理张量数据进行维度折叠。
本公开实施例中,编译器可以根据目标维度折叠信息中表示的可折叠维度以及目标维度折叠指引,对操作中的所有待处理张量数据进行一致的维度折叠操作,从而保证操作的正确执行。同时本公开的实施例的编译优化方法,通过编译器在编译时可对张量数据的维度信息进行降维处理,可以获得维度更低,维数值更高的张量数据,减少了对张量数据循环处理的数量,减少了控制流,因而在运行时可充分利用处理器和存储器的带宽,一次性读取更多数据,提高了张量数据的处理效率。
可选地,上述方法还可以根据降维处理后的张量数据执行相应的操作。上述方法还可以包括:在步骤S550中,在对所述操作中的操作数进行维度折叠之后,编译生成所述操作获得所述操作操作对应的指令。该指令可以是目标处理器可以执行的硬件 指令,当目标处理器执行该指令时,能够实现上述操作的功能,包括但不限于运算功能和访存功能(如数据拷贝功能等)。
例如,该操作为加法操作,其可以包括三个操作数%a,%b和%c,该三个操作数均为张量数据,且三个操作数进行维度折叠后分别为%a’,%b’和%c’。
%a’=mlu.memrefcast%a memref<64x128xi32,101>to memref<8192xi32,101>;
%b’=mlu.memrefcast%b memref<64x128xi32,101>to memref<8192xi32,101>;
%c’=mlu.memrefcast%c memref<64x128xi32,101>to memref<8192xi32,101>;
此时,该加法操作可以表示如下:
mlu.add(%a’,%b’,%c’):(memref<8192xi32,101>,memref<8192xi32,101>,memref<8192xi32,101>);
可见,在加法操作中,每个张量数据的折叠后的维度值均为8192,每个张量数据均是按照8192的粒度进行运算和访存,从而目标处理器在执行基于此生成的硬件指令时,可以提高数据的访存效率和运算效率。
可选地,实现上述操作后的结果数据的表达方式可以是与折叠后的张量数据的表达形式一致。但该结果数据的维度与初始的待处理张量数据的维度可能不同,此时为了保证程序的前后运行可靠性和准确性,可以再将折叠后的张量数据的表示形式展开为初始表示形式。例如,承接上例,由于加法操作中各个操作数的表示形式均为8192,故加法运算获得结果数据的表示形式也为8192,此时为了保证后续执行的正确性,可以将该加法运算操作的结果数据展开为64×128。
本公开还提供了一种计算机设备,该计算机设备可以包括处理器和存储器,存储器上存储有计算机可执行程序,当所述处理器执行上述计算机可执行程序时,实现上述任一实施例的方法。本公开实施例中,该处理器上可运行有一编译器,编译器可以实现上述任一实施例的方法。具体地,编译器用于:获取待处理张量数据的维度信息,其中,所述维度信息包含所述待处理张量数据的至少一个维度的维度值;根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息;根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行维度折叠。
在一个实施例中,维度折叠信息包括可折叠维度;编译器具体用于:根据所述待处理张量数据的维度信息,确定所述待处理张量数据的相邻维度是否连续;当所述待处理张量数据的相邻维度连续时,则确定所述相邻维度为可折叠维度。
在一个实施例中,所述待处理张量数据的相邻维度包括第一维度和第二维度,编译器还用于:当所述第一维度方向上的数据之间的步长等于第二维度方向上数据之间的步长和第二维度方向上数据的尺寸之积时,则确定所述待处理张量数据的相邻维度连续。
在一个实施例中,所述维度折叠信息还包括维度折叠索引;编译器还用于:根据所述待处理张量数据中可折叠维度的数量,确定可折叠维度索引。
在一个实施例中,编译器还用于:当所述待处理张量数据的维度值为常量,则根据所述待处理张量数据的维度折叠信息确定折叠后的张量维度,以实现对所述待处理张量数据的降维处理;其中,所述折叠后的张量维度等于所述待处理张量数据的可折叠维度的维度值的乘积。
在一个实施例中,编译器还用于:若所述待处理张量数据的维度值为常量,则将 相应维度值替换为所述常量。
在一个实施例中,所述待处理张量数据指向操作中的操作数;编译器还用于:根据所述操作中至少一个操作数对应的维度折叠信息,确定目标维度折叠信息;所述目标维度折叠信息用于指示所述至少一个操作数在所述操作中的维度折叠方式。
在一个实施例中,所述目标维度折叠信息包括目标可折叠维度和目标维度折叠索引;编译器还用于:根据所述至少一个操作数的可折叠维度,确定目标可折叠维度,其中所述目标可折叠维度为所述至少一个操作数的可折叠维度的交集;基于所述目标可折叠维度和所述至少一个操作数的维度折叠指引,确定所述目标维度折叠索引。
在一个实施例中,编译器还用于:根据所述目标维度折叠信息,对所述操作中的所有操作数进行维度折叠。
在一个实施例中,编译器还用于:在对所述操作中的操作数进行维度折叠之后,编译生成所述操作获得所述操作操作对应的指令,以使目标处理器根据所述指令实现所述操作的功能,获得结果数据。
在一个实施例中,编译器还用于:根据所述目标折叠信息,对所述结果数据进行维度展开,使得所述结果数据的表达形式与所述操作数的初始表达形式一致。
当上述待处理张量数据为变量数据时,本公开还提供一种用于处理该待处理张量数据(即多维张量数据)的方法及其相关产品,下面结合附图来详细描述本公开的具体实施方式。
图12是示出根据本公开实施例的用于处理多维张量数据的方法700的流程图。结合前文的描述,本领域技术人员可以理解本公开的方法700可以适用于利用多维张量数据进行运算的场景。特别地,方法700可以应用于人工智能领域(例如深度机器学习)中对于多维张量数据的处理。在实施方面,此处的方法700可以由处理器来执行,例如在针对于神经网络模型的程序代码的编译阶段来执行,即采用图12中的编译器603来执行。根据本公开的一个实施方式,上文中所述的根据所述待处理张量数据的相邻维度相关参数的语法树来确定所述待处理张量数据的相邻维度是否连续可以包括如图12所示的方法。
如图12所示,在步骤S702处,获取与多维张量数据中的第一维度关联的第一表达式。如前所述,这里的多维张量数据可以是二维或二维以上的张量数据,并且维度的排列可以具有一定的顺序。在一个实施例中,本公开此处的多维张量数据可以是三维或四维张量数据。例如,在四维张量数据的情形中,该张量数据可以具有NHWC的数据格式,其中N表示批处理数(Batch size),H表示张量数据的高度,W表示张量数据的宽度,而C表示张量数据的通道数,如图3中所示出的数据立方体。这里,N维度可以认为是最高维度,而C维度可以认为是最低维度,而H维度和W维度是处于最高维度和最低维度之间的中间维度,并且H维度要高于W维度。
接着,在步骤S704处,获取与多维张量数据中的第二维度关联的第二表达式。在一个实施方式中,这里的第二维度和上述的第一维度是相邻维度。仍以如图3中NHWC数据摆放格式的四维张量数据为例,其中N和H互为相邻维度,其中N维度为第一维度而H维度为第二维度。类似地,H维度和W维度互为相邻维度,其中H维度为第一维度而W维度为第二维度。同样地,W维度和C维度为相邻维度,其中W维度为第一维度而C维度为第二维度。就与各个维度关联的表达式而言,在一个实 施方式中,上述的第一表达式可以是基于多维张量数据的第一维度的跨度(“stride”)所构建的表达式。对应地,上述的第二表达式可以是基于多维张量数据的第二维度的跨度(“stride”)和尺寸(“size”)所构建的表达式。
在获取第一表达式和第二表达式后,在步骤S706处,判断第一表达式和第二表达式是否相等,以便确定第一维度和第二维度是否能够进行维度折叠。举例而言,假设第一维度为W维度而第二维度为C维度,并且第一表达式基于W维度的跨度而第二表达式基于C维度的跨度和尺寸,则此处的相等可以表达为等式:W stride=C stride×C size,其中W stride表示W维度的跨度而C stride和C size分别表示C维度的跨度和尺寸。
以图3中进一步标示出C 0~C 4、W 0~W 4和H 0-H 4的张量数据(其示出为一个大的立方体,其中所包含的每个小立方体代表具有最小粒度的数据块)来说,当试图将中间维度的W维度和最低维度的C维度进行维度折叠时,可以通过等式W stride=C stride×C size来确定,如上文所描述的。由于图中张量数据的W stride=4,C stride=1并且C size=4,即满足前述等式,因此可以判断C维度的数据在W维度上是连续的,因此可以执行折叠操作。通过对张量数据执行W维度和C维度的折叠操作,可以令张量数据具有图3的右部301所示出的数据形状(也即“降维”后的形状),从而方便后续的数据读取,显著减小代码实现过程中的嵌套循环操作(下文将详述)。具体地,第1数据块可以对应于坐标H 0W 0C 0,第2数据块可以对应于坐标H 0W 0C 1,以此类推,第四数据块可对应于坐标H 0W 0C 4,即第1~第4数据块对应于张量数据中左上角沿C维度排列的四个数据块。接着,第5数据块对应于坐标H 0W 1C 0,第6数据块对应于坐标H 0W 1C 1,第7数据块对应于坐标H 0W 1C 2,第8数据块对应于坐标H 0W 1C 3。以此类推,直至得到如图中所示出的沿H 0、H 1、H 2、和H 3的H维度所排列的四行数据。可以看出,此时W和C维度已经折叠成同一维数据,以便于数据的拷贝操作和运算操作。
关于上述的拷贝操作,以图3所示出的一个数据立方体(即“HWC”三维数据)为例来说。当执行数据拷贝时,从代码实现的角度来说,通常做法是执行多重嵌套循环来读取完所有的数据,其中H维度的大小确定外层的循环次数,而W维度的大小确定内层的循环次数,即构成两层嵌套循环,而C维度的大小确定每次循环所读取的数据块数目。鉴于此,图3所示的一个数据立方体通常需要执行4×4=16次的读取操作,每次读取C维度上的4个数据块。
然而,通过本公开的折叠操作,数据将以降维后的形式(或者说形状)被读写和运算,如301处示出的方式。由此,在执行数据拷贝和运算时的读取过程中,将仅执行一层循环,即仅执行最高维度(即H维度)的4次循环,而每次循环将读取折叠后的“WC”维度上16个数据块。显然,由于将两层嵌套循环转变成一次循环,本公开维度折叠后的数据读取和运算的粒度增大(从4变为16),从而令数据读取和运算更为高效,由此显著提升数据的访存性能和运算性能。进一步,由于减小了嵌套循环,因此也缩减了代码中的控制流开销。
进一步地,本公开的方案可以根据维度折叠后的张量数据生成针对所述张量数据的操作指令,该操作指令包括但不限于访存指令和运算指令,该操作指令可以是上述芯片或板卡能够运行的指令。继续回到前文,本公开维度折叠后的数据读取和运算的 粒度增大(从4变为16),芯片或板卡执行该基于该维度折叠优化后的代码所生成的输入/输出(“I/O”)访存指令,可以显著降低I/O访存的次数并且由此大幅降低在I/O方面的开销。同时,芯片或板卡执行基于该维度折叠优化后的代码所生成的运算指令,可以显著地提高硬件的运算效率。
可以理解的是,经过上述针对W维度和C维度的折叠操作,可以将图中的张量立方体“降维”成一个如301处所示出的二维矩阵,以加速数据的读取并减小如上所述的嵌套循环操作,其中该矩阵的高度仍以H维度构成,而宽度则由C维度和W维度折叠后的新维度构成。根据本公开的折叠方案,仍可以判断是否对H维度和该新维度进行折叠操作。具体地,由于H维度的跨度为16,而所述新维度的跨度为1且尺寸为16,因此仍满足本公开所设定的等式要求,即H维度的数据在折叠后的“WC”维度上是连续的。鉴于此,接着对前述得到的二维矩阵进行维度折叠,从而可以得到如302处示出的一组数组形式(或形状)的张量数据,也即沿H维度排列的从数据1~数据64的数组。通过对H维度和“WC”维度的维度折叠使数据立方体以一维形式呈现,可以在数据拷贝时实现将图3中的一个数据立方体整体进行读取,即不需要嵌套循环操作。由此,就编译阶段而言,可以进一步优化代码,减小控制流方面的开销。另外,由于多维张量数据经折叠后降低了维度,也便于张量数据之间的各类运算,例如对于两个张量数据之间的加法运算,降维后的加法运算将更为简便和快速。
上面结合图3对维度折叠操作及其效果进行了描述。应当清楚的是,上文只以折叠张量数据的其中两个维度进行了说明,但本公开方案的张量维度折叠并不限于两个维度的折叠方案。在其他实施例中,若张量数据在两个以上的维度均连续,则可以在更高维度上进行折叠操作。例如,在确定图3中的W维度和C维度连续之后,可以继续判断该折叠后的(WC)维度与H维度是否连续,如果连续,则可将三维的张量数据折叠为一维数据,并基于该折叠操作生成芯片或板卡能够执行的操作指令,从而提高芯片或板卡对该张量数据的访存和运算效率。其中,折叠后的(WC)维度与H维度的是否连续的判断方法与上述W维度和C维度的判断方法一致,此处不再赘述。
回到图12,当应用于人工智能领域的深度机器学习场景时,图12所示方法可以应用在对神经网络模型进行编译操作所生成的中间表达上,其中编译操作可以通过编译器来执行。关于编译阶段的中间表达,如本领域技术人员所知,编译器通常可以分为前端和后端。在编译过程中,前端会对所输入的程序代码进行例如包括词法分析、语法分析或语义分析等各类分析,并且接着生成类似于数据结构的中间表达IR(Intermediate Representation)。
此后,由编译器的后端对“IR”进行优化,然后生成目标代码。在深度机器学习的场景中,这里的编译器可以是神经网络编译器,如“TVM”等。该神经网络编译器可以从神经网络编程框架(如Tensorflow、pytorch或caffe等)接收神经网络模型。接着,由编译前端首先对神经网络进行解析和重构,从而获得计算图。然后,可以利用编译器中的优化器对生成的计算图进行融合(例如多个算子的融合)或剪枝(例如去除计算图中与最终输出节点无关的边和算子)等优化操作,从而获得图形式的中间表达,也即前述的“IR”。接着,可以将该图形式的中间表达转换为与硬件相适配的中间表达。最后,可以根据该硬件相适配的中间表达来生成该硬件上能够执行的代码,也即可以由例如结合图11描述的AI处理器所执行的二进制指令序列,图11将在后文中进行更详细描述。
通过利用上述神经网络模型的中间表达,方法700还可以获取根据第一表达式所构建的第一抽象语法树和根据第二表达式所构建的第二抽象语法树。如本领域技术人员所知,抽象语法树包括多个节点,并且大致可以分为根节点,其位于抽象语法树的最上层;叶子节点,其位于抽象语法树的最下层;以及子节点,其位于根节点和叶子节点之间的中间层。基于此,构建抽象语法树的过程即是先构建抽象语法树的根节点,然后向下依次构建各个子节点(如果需要的话),直到叶子节点。
通过利用中间表达来构建抽象语法树,本公开的方案可以将判断第一表达式与第二表达式是否相等转换为判断第一抽象语法树和第二抽象语法树是否相等(或者说等价)。由此,当第一抽象语法树和第二抽象语法树相等时,则可以确定第一表达式与第二表达式相等,由此可以确定前述的第一维度和第二维度能够进行维度折叠。相较而言,当第一抽象语法树和第二抽象语法树不相等时,则可以确定第一表达式与第二表达式不相等,由此可以确定第一维度和第二维度不能进行维度折叠。如前所述,当在编译阶段执行本申请的维度折叠操作时,可以减小多重嵌套循环操作,从而减小控制流的开销。进一步,由于进行了维度折叠的降维操作,针对于该降维后的张量数据所生成的I/O指令在执行数据访存时将明显提升访存效率,从而降低I/O方面的开销。
当第一表达式涉及第一维度的跨度时,在利用神经网络模型的中间表达来获取第一抽象语法树的具体实现中,本公开提出沿中间表达向前追溯获取所述第一维度的跨度的计算过程(其也是一个表达式)。在向前追溯期间,可以基于计算过程来构建第一抽象语法树的根节点以及前述根节点下的叶子节点或者根节点下的子节点和叶子节点。类似地,当第二表达式涉及第二维度的跨度和尺寸时,获取第二抽象语法树可以包括沿所述神经网络模型的中间表达向前追溯获取第二维度的跨度和尺寸的计算过程。同样地,在向前追溯期间,可以基于计算过程来构建第二抽象语法树的根节点以及根节点下的叶子节点或者根节点下的子节点和叶子节点。如本领域技术人员所知,抽象语法树中的每个节点可以用于表示程序代码中的常量或变量。前述变量可以是静态单赋值(“Static Single Assignment”,SSA)的。在此基础上,具体到本公开的方案,根节点和子节点可以表示各自关联的SSA值和获得该SSA值所需要执行的运算,则叶子节点表示追溯到停止时的SSA值。对于具有子节点的节点来说,子节点进一步可以表示执行运算所需要的操作数。本公开的方案中,语法树中常量或变量可以具有一个预先设定的系数(“coefficient”),该系数用于表示当前节点参与其父节点计算时所需乘以的系数。
在一些实施场景中,为了加快追溯过程和建立抽象语法树,本公开提出在一些情形发生时则停止追溯过程。例如,当沿中间表达追溯至内核函数(“kernel”)的入口参数时或者沿中间表达追溯至特定的运算时。就内核函数而言,在不同的场景中,其可以表示神经网络模型的计算过程或算子(例如卷积算子)的计算过程,而入口参数可以是计算过程所需输入的参数。就特定的运算而言,假设本公开的方案仅支持计算过程中所涉及的例如加、减和乘运算时,则当追溯到例如针对维度或步长的具体运算,则此时可以停止追溯,并且将停止追溯时的变量用叶子节点来表示,也即得到抽象语法树的最低层节点。
当停止追溯时,抽象语法树的构建也即完成。此后,可以将得到的第一抽象语法树和第二抽象语法树进行比较,以判断二者是否相同。如前所述,第一抽象语法树或 第二抽象语法树可以是由节点构成的分层结构,因此判断第一抽象语法树和第二抽象语法树是否相同可以是逐层地判断第一抽象语法树和第二抽象语法树是否相同。附加地或替代地,还可以通过判断第一抽象语法树和第二抽象语法树的前预定数目的层是否相同来确定二者是否相同。例如,当前述的两棵抽象语法树由于追溯的特别深从而导致抽象语法树的层数较多时,此时对这样的两棵抽象语法树进行判断将造成明显的时间成本。为此,除了对两棵抽象语法树进行预处理(例如转换,稍后将结合附图描述)以外,本公开提出只对两棵抽象语法树的前若干个层(例如前两层或前三层)进行比对和判断,从而简化两棵抽象语法树的比对操作。
就具体的判断操作而言,在一个实施例中,判断第一抽象语法树和第二抽象语法树是否相同可以包括判断第一抽象语法树的第一根节点和所述第二抽象语法树的第二根节点是否具有相同的运算类型。当二者具有相同的运算类型时,则还需要判断第一根节点下的所有子节点和叶子节点与第二根节点下的所有子节点和叶子节点是否彼此相同或者等价。如上文所述,抽象语法树可以呈分层的结构,因此在比较第一抽象语法树和第二抽象语法树时,可以逐层地判断各个子节点和位于最低层的叶子节点是否相同或者等价。当第一抽象语法树和第二抽象语法树的所有节点都相同或者等价时,则可以认定第一抽象语法树和第二抽象语法树是相同的抽象语法树,并且由此确定第一维度和第二维度可以进行维度折叠,即将第一维度和第二维度上的数据摆放在同一个维度上。对于层数不同的两个抽象语法树或者经本公开下文的转换操作处理后层数仍不同的两个抽象语法树,则可以认定第一表达式和第二表达式是不同的,从而由此确定第一维度和第二维度并不能进行维度折叠。进一步,由于无法维度折叠,张量数据也将无法以“降维”后的形式来表达、读写和运算。
以上结合图12对本公开的维度折叠方案进行了详细的描述。可以理解的是,当通过判断第一抽象语法树和第二抽象语法树是否相同来确定是否执行第一维度和第二维度的折叠时,对于抽象语法树的进一步处理就显得尤为重要。为此,本公开提出对构建得到的第一抽象语法树和第二抽象语法树进行各种类型的转换,以便对基于转换后的、具有最简形式的抽象语法树来进行判断。在本公开的上下文中,最简形式的抽象语法树通常最多具有从上到下排列的三层,即树根层-中间层(包括表示加法或乘法运算的子节点)-叶子节点层。在一些场景中,转换后的抽象语法树仅有两层,即没有处于中间层的子节点。可以理解的是,本公开的如下转换操作仅仅是示例性的和可选择性的,本公开的折叠操作并不受其限制。例如,对于构建的、初始即具有前述最简形式的抽象语法树,则可以不进行如下的转换操作。又例如,对于经构建后无法比较的两个抽象语法树(如层数不同且无法进行如下转换操作进行化简的),则本公开可以直接认定第一表达式和第二表达式并不相同。由此,可以直接判断第一维度和第二维度不具有可折叠性,也即无法折叠和如前所述的连续读取,即仍需循环嵌套读取。下面将结合图13-图20来描述本公开对获取的抽象语法树进行转换操作的不同实施例。
图13是示出在转换抽象语法树的过程中,将减法运算转换为加法运算,并且将乘法运算下移而将加法运算上移的场景。为了实现将减法运算转换为加法运算,可以令节点的系数为-1(默认为1),这里系数可以表示当前节点参与父节点计算时所要乘的值。在该情形下,抽象语法树中将仅存在加法和乘法运算,从而有利于后续的判 断操作。具体来说,在仅存在加法和乘法运算的两个表达式中,如果两个表达式是相等的,则将其转换成加法形式后(例如通过将乘法结合律展开),两个表达式的各个项将能对应上,由此方便对两个表达式是否相同进行判断。
以图13中左部所示出的抽象语法树为例,其表示“(a+b)×(c+d)”的表达式。通过将乘法结合律进行展开,可以得到“ac+ad+bc+bd”的表达式,而与之对应的抽象语法树可以如图13右部所示。可以看出,通过转换操作,可以将表达乘法运算(如图13左部的根节点所示)的抽象语法树转换成表达加法运算(如图13右部的根节点所示)的抽象语法树。概括来说,图13所示转换操作即是将乘法运算对应的根节点或子节点向抽象语法树(如本公开的第一抽象语法树和/或第二抽象语法树)的下层方向移动,并且将加法运算对应的子节点和/或叶子节点向抽象语法树的上层方向移动。
图14和图15是示出在转换抽象语法树的过程中,将具有相同运算类型的父节点和子节点进行合并的场景。
如图14左部所示,该抽象语法树具有表示加法运算的根节点以及两个执行加法运算的子节点。鉴于运算类型相同,因此可以将该根节点和两个子节点进行合并,从而得到图14右部所示出的抽象语法树。从抽象语法树所表示的表达式来看,图中所示例子的转换操作也即是将表达式“(a+b)+(c+d)”转换成表达式“a+b+c+d”。
如图15左部所示,该抽象语法树具有表示加法运算的根节点和分别表示加法运算和乘法运算的两个子节点。与图14类似,鉴于根节点和左子节点表示相同的加法运算,因此可以将该左子节点和根节点进行合并,从而将叶子节点所表示的变量a和b向上层移动;由此,得到如图15的右部所示出的抽象语法树。从抽象语法树所表示的表达式来看,图中所示例子的转换操作也即是将表达式“(a+b)+(c×d)”转换成表达式“a+b+(c×d)”。
通过上述的示例性合并操作,转换后的抽象语法树通常最多具有从上到下排列的三层,即树根层-中间层(包括表示加法或乘法运算的子节点)-叶子节点层。在一些场景中,转换后的抽象语法树仅有两层,即没有处于中间层的子节点。在该场景中,抽象语法树中的根节点直接和叶子节点相连。概括来说,图14和图15所示转换操作即是确定抽象语法树中是否存在运算类型相同的根节点和子节点,以及响应于存在运算类型相同的根节点和子节点,将根节点和子节点进行合并。
在一个场景中,本公开还提出对抽象语法树执行化简操作,具体为同类项的化简操作。这里,同类项可以是抽象语法树中表示立即数、立即数乘以变量和/或变量形式的节点。例如,图16示出了对表示立即数的节点的化简操作,也即将图16左部抽象语法树中表示7、4和2的叶子节点化简变成图16右部抽象语法树中表示立即数13的节点,从而将抽象语法树进行化简。类似地,图17示出了对表示立即数乘以变量的节点的化简操作。具体地,通过将图17左部抽象语法树中的表示“a×3”的节点与表示变量“a”的节点进行化简,可以得到图17右部包括表示“4a”的节点与根节点连接的抽象语法树。进一步,图18示出了对表示变量的节点进行化简操作,即将图18左部抽象语法树中的表示a、b和-a的节点进行合并,从而得到如图18右部所示出的抽象语法树,其中表示“b”的叶子节点直接与根节点相连。
在一些场景中,如果子节点的数目相对较多,则为了找到上述的同类项进行合并,可以针对抽象语法树中的节点来生成基于特征的哈希值,并且根据生成的哈希值来进 行排序。当出现相同的哈希值时,则可以认为节点具有相同的同类项,从而可以通过合并来简化抽象语法树。以图17为例,对图17中的节点执行基于特征的哈希运算,由于不考虑系数,因此“3a”和“a”这两个变量具有相同的哈希值。由此判断两个节点是同类项,可以通过合并来化简,从而得到表示“4a”的节点。
通过上述的化简操作,可以将所有表示立即数的节点化简为最简单的形式,并且还可以将所有的同类项进行合并,从而可以得到最多只有三层的抽象语法树。在一些场景中,对抽象语法树的转换还可以包括将其中的分支进行消除,下面将结合图19和图20来此情形进行描述。
图19左部的抽象语法树示出了表示“x”和“y”变量的节点通过根节点的加法运算相连。当“y”节点的系数为0时,则表示该分支经过化简后完全被消除了,因此可以得到如图13右部所示出的抽象语法树。图20左部的抽象语法树示出了表示“c”变量的叶子节点通过乘法运算的子节点与根节点相连。由于对于表示乘法运算的子节点来说,表示“c”变量的叶子节点是其唯一的子节点,因此可以将该分支消除,从而得到如图20右部所示出的抽象语法树。
以上结合图13-图20对本公开就抽象语法树执行的转换操作进行了描述。在一些场景中,可以基于转换化的第一抽象语法树和第二抽象语法树来进行比对判断,以便通过二者的相同与否来确定第一维度和第二维度是否可以进行折叠。下面将结合图4对本公开关于如何判断两个抽象语法树为相同的抽象语法树进行描述。
假定图4上部是本公开上下文中的第一抽象语法树,其可以是沿神经网络模型的中间表达向前追溯获取第一维度(例如前述的W维度)的跨度的计算过程期间所构建的。相对应地,图4下部是本公开上下文的第二抽象语法树,其可以是沿神经网络模型的中间表达向前追溯获取第二维度(例如前述的C维度)的跨度和尺寸的计算过程期间所构建的。通过判断图4中的第一抽象语法树和第二抽象语法树相同,即可以推断出产前述的等式W stride=C stride×C size成立,从而判断出第一维度和第二维度是连续的,二者可以执行维度折叠操作以转换到一个维度上。
从图4中所示可以看出,第一抽象语法树可以代表“(a×b)+(c×d)+(e×f×g)+(-h)”的表达式(其例如等价于W stride),而第二抽象语法树可以代表“(-h)(b×a)+(c×d)+(e×f×g)”(其例如等价于“C stride×C size”)的表达式。由于第一抽象语法树和第二抽象语法树在第二层和第三层上还存在差异,在一个实施场景中,如前所述,本公开提出利用基于指纹的哈希运算来确定第一根节点下的所有子节点和第二根节点下的所有子节点是否彼此等价。具体来说,可以对第一根节点和第二根节点下的所有子节点所表示的变量分别执行基于指纹的哈希运算。接着,可以对哈希运算的运算结果进行排序。最后,可以将排序后第一根节点和第二根节点下具有相同运算结果的子节点判断为彼此等价。对于基于指纹的哈希运算来说,抽象语法树中的某个节点的哈希值就等于与其相连的所有子节点的哈希值与各自系数乘积后的和。对于是常量的子节点,其哈希值就是其本身,而对于叶子节点来说,其哈希值是其保存的操作数的指针值。就如图4所示例子来说,当第一抽象树和第二抽象树中连接根节点并代表“(a×b)”、“(c×d)”、“(e×f×g)”和“-h”的四个子树的哈希值都相同时,则可以判断两棵抽象语法树相同。由此,可以确定与两棵抽象语法树关联的第一维度和第二维度是可以折叠的。
上述实施例描述了单个张量数据的维度折叠过程,在本公开的一种实施方式中,该张量数据可以是某一操作涉及的数据。在本公开折叠方案的一个应用场景中,当参与操作(“op”)的数据涉及两个以上的多维张量数据,且两个以上的多维张量数据具有各自对应维度是可折叠时,则为了操作的正确执行,两个以上多维张量数据的折叠方式相同。其中,该操作包括但不限于运算操作(如加减乘除等四则运算操作、卷积操作、激活操作、全连接操作等等)和访存操作(如拷贝操作)。同一操作中每个多维张量数据是否能够进行维度折叠可以参照上述实施例。
例如,当操作是加法运算(即add算子),且加法运算具有两个操作数,两个操作数分别为A和B两个张量数据,A和B两个张量数据都具有HWC的数据格式,则当将A张量数据折叠为H*(WC)的形状时,则B张量数据也必须折叠成H*(WC)的形状,以便执行加法运算。否则,当仅对A张量数据进行维度折叠而B张量数据未进行维度折叠,则由于加法运算的输入数据发生变化(此例中的A张量数据的形状变化造成的数据读取变化)而造成计算无法进行或发生错误。之后,在对上述两个操作数进行相同的维度折叠之后,可以生成该操作对应的操作指令。该操作指令可以是人工智能处理器能够直接执行的操作指令,在人工智能处理器执行该操作指令时,可以以折叠后的张量数据为粒度对张量数据进行读取和运算,从而提高了该张量数据的处理效率。
再例如,当操作是拷贝操作(即copy算子)时,为了实现正确的拷贝操作,源数据和拷贝操作得到的目标数据也需要具有相同的折叠方式。假设该拷贝操作用于将A张量数据从片外存储空间DDR拷贝至片上存储空间。其中,A张量数据是存储在DDR上的源数据且具有4*3*2(对应于HWC维度)的数据形状,当将其拷贝至片上存储空间时,则执行copy(4*3*2->4*3*2),其中“->”用于指示数据拷贝的方向,“->”左侧的“4*3*2”代表位于DDR上的源数据,“->”右侧的“4*3*2”代表位于片上的目标数据。当源数据“4*3*2”折叠为“4*6”的数据形状时,则目标数据也同样需要折叠为“4*6”的数据形状,由此使得DDR上的源数据和片上的目标数据解析数据的方式一致,从而实现数据拷贝操作。进一步可以生成该拷贝操作对应的访存指令,在人工智能处理器执行该操作指令时,可以以折叠后的张量数据为粒度对张量数据进行读取,从而提高了该张量数据的访存效率。
图6示出根据本公开实施例的一种板卡10的结构示意图。可以理解的是图6所示结构和组成仅仅是一种示例,其并不用于在任何方面对本公开的方案进行限制。
如图6所示,板卡10包括芯片101,其可以是一种系统级芯片(System on Chip,SoC),也即本公开上下文中所描述的片上系统。在一个实施场景中,其可以集成有一个或多个组合处理装置。前述组合处理装置可以是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求,特别是深度学习技术大量应用在云端智能领域。云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,而本实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
进一步如图中所示,芯片101通过对外接口装置102与外部设备103相连接。根据不同的应用场景,外部设备103例如可以是服务器、计算机、摄像头、显示器、鼠 标、键盘、网卡或WI-FI接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还可以包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106可以配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图7是示出根据上述实施例的芯片101中的组合处理装置的结构图。如图7中所示,组合处理装置20可以包括计算装置201、接口装置202、处理装置203和动态随机存取存储器(Dynamic Random Access Memory,DRAM)DRAM 204。
计算装置201可以配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器。在一些操作中,其可以用于执行深度学习或机器学习方面的计算,并且还可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。在一个实施场景中,此处的计算装置可以配置成执行本公开上下文中的卷积操作或矩阵乘操作。
接口装置202可以用于在计算装置201与处理装置203间传输数据和控制指令。例如,计算装置201可以经由接口装置202从处理装置203中获取输入数据(例如本公开上下文中与神经网络运算相关的各种类型数据,包括经折叠后的张量数据),写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本公开的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。在一些实施场景中,此处的异构多核结构中的处理装置203可以对体现神经网络模型的指令代码进行编译,以形成可以由计算装置201可执行的二进制指令序列。
DRAM 204用以存储待处理的数据,并且在一个实现场景中可以是DDR内存,其大小通常可以为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图8示出了计算装置201为单核的内部结构示意图。单核计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,单核计算装置301包括三大模块:控制模块31、运算模块32及存储模块33。
控制模块31用以协调并控制运算模块32和存储模块33的工作,以完成深度学 习的任务,其包括取指单元(Instruction Fetch Unit,IFU)311及指令译码单元(Instruction Decode Unit,IDU)312。取指单元311用以获取来自处理装置203的指令,指令译码单元312则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块32和存储模块33。当执行本公开的具体方案时,这里的指令可以是用于执行矩阵乘或卷积运算的通用卷积指令。
运算模块32包括向量运算单元321和矩阵运算单元322。向量运算单元321可以用于执行向量运算,并且可支持向量乘、加、非线性变换等相对复杂的运算。相对而言,矩阵运算单元322负责深度学习算法的核心计算,即本公开上下文中所提到的矩阵乘和卷积运算。存储模块33可以用于存储或搬运相关数据,包括神经元存储单元(Neuron RAM,NRAM)331、参数存储单元(Weight RAM,WRAM)332、直接内存访问模块(Direct Memory Access,DMA)333。NRAM 331用以存储输入神经元、输出神经元和计算后的中间结果;WRAM 332则用以存储深度学习网络的卷积核,即权值;DMA 333通过总线34连接DRAM 204,负责单核计算装置301与DRAM 204间的数据搬运。
图9示出了计算装置201为多核的内部结构示意图。多核计算装置41可以采用分层结构设计并且可以作为一个片上系统来运行,其可以包括至少一个集群(cluster),每个集群又包括多个处理器核。换言之,多核计算装置41是以片上系统-集群-处理器核的层次所构成的。以片上系统的层级来看,如图9所示,多核计算装置41包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。
外部存储控制器401可以有多个(如图中示例性地示出2个),其用以响应处理器核发出的访问请求,访问外部存储设备,也即本公开上下文中的片外存储器(例如图7中的DRAM 204),从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来,用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(Global Barrier Controller,GBC),用以协调各集群的工作进度,确保信息的同步。本公开的多个集群405是多核计算装置41的计算核心。尽管在图9中示例性地示出4个集群,然而,随着硬件的发展,本公开的多核计算装置41还可以包括8个、16个、64个、甚至更多的集群405。在一个应用场景中,集群405可以用于高效地执行深度学习算法。
以集群的层级来看,如图9所示,每个集群405可以包括多个处理器核(IPU core)406及一个存储核(MEM core)407。
处理器核406在图中示例性地示出为4个,本公开不限制处理器核406的数量,并且其内部架构如图10所示。每个处理器核406类似于图8的单核计算装置301,并且同样可以包括三个模块:控制模块51、运算模块52和存储模块53。控制模块51、运算模块52及存储模块53的功用及结构大致与控制模块31、运算模块32及存储模块33相同,此处不再赘述。需特别说明的是,存储模块53可以包括输入/输出直接内存访问模块(Input/Output Direct Memory Access,IODMA)533、搬运直接内存访问模块(Move Direct Memory Access,MVDMA)534。IODMA 533通过广播总线409 控制NRAM 531/WRAM 532与DRAM 204的访存;MVDMA 534则用以控制NRAM 531/WRAM 532与存储单元(SRAM)408的访存。
回到图9,存储核407主要用以存储和通信,即存储处理器核406间的共享数据或中间结果、执行集群405与DRAM 204之间的通信、集群405间彼此的通信以及处理器核406间彼此的通信等。在其他实施例中,存储核407可以具有标量运算的能力,用以执行标量运算。
存储核407可以包括静态随机存取存储器(Static Random-Access Memory,SRAM)408、广播总线409、集群直接内存访问模块(Cluster Direct Memory Access,CDMA)410及全局直接内存访问模块(Global Direct Memory Access,GDMA)411。在一个实施场景中,SRAM 408可以承担高性能数据中转站的角色。由此,在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向DRAM 204获得,而是经SRAM 408在处理器核406间中转。进一步,存储核407仅需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可,从而可以提高核间通信效率,并显著减少片上片外的输入/输出访问。
广播总线409、CDMA 410及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与DRAM 204的数据传输。以下将分别说明。
广播总线409用以完成集群405内各处理器核406间的高速通信,此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(例如单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式,而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式,属于多播的一种特例。
CDMA 410用以控制在同一个计算装置201内不同集群405间的SRAM 408的访存。GDMA 411与外部存储控制器401协同,用以控制集群405的SRAM 408到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 408中。从前述可知,DRAM 204与NRAM 531或WRAM 532间的通信可以经由2种方式来实现。第一种方式是通过IODAM 533直接和DRAM 204与NRAM 531或WRAM 532通信;第二种方式是先经由GDMA 411使得数据在DRAM 204与SRAM 408间传输,再经过MVDMA 534使得数据在SRAM 408与NRAM 531或WRAM 532间传输。尽管第二种方式可能需要更多的元件参与且数据流较长,但实际上在部分实施例中,第二种方式的带宽远大于第一种方式,因此通过第二种方式来执行DRAM 204与NRAM 531或WRAM 532间的通信可能更为有效。可以理解的是,这里所描述的数据传输方式仅仅是示例性的,并且本领域技术人员根据本公开的教导,也可以根据硬件的具体布置来灵活地选择和适用各种数据传输方式。
在其他的实施例中,GDMA 411的功能和IODMA 533的功能可以整合在同一部件中。尽管本公开为了方便描述,将GDMA 411和IODMA 533视为不同的部件,然而对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本公开类似,即属于本公开的保护范围。进一步地,GDMA 411的功能、IODMA 533的功能、CDMA 410的功能、MVDMA 534的功能也可以由同一部件来实现。
图11示出本公开一实施例中数据流编程的软硬件架构的设计图。从图中所示可以看出,此实施例中的软硬件架构可以包括人工智能(“AI”)处理器601、驱动及 操作系统602、编译器及编译语言603、库604、框架层605和应用层606。
具体来说,AI处理器601在硬件设计上同时考虑运算优化和数据搬运优化。为此,其采用定制化的运算单元来加速运算,并且使用片上存储来加速数据搬运,从而获得极高的性能和能效比。另外,为了支持各种算法优化,AI处理器601可以具有定制化的运算单元和指令集,其中指令集可以提供不同粒度的运算指令(标量、向量和/或矩阵)。进一步,当考虑算法访存特征、硬件成本、验证难度等多方面的因素,则可以采用片上存储的方式,并且优化数据搬运。在实际操作中,本公开的AI处理器可以实现超出主流GPU(图形处理单元)几十倍以上的速度。
驱动及操作系统602主要负责实现任务在AI处理器601上的调度。该调度操作可以涉及分配、释放设备内存、根据任务优先级进行调度、多设备之间的通信及同步等。对于编译后的程序,其可以通过操作系统和驱动实现待实施的任务在特定处理器上的调度执行,包括但不限于如下的操作:分配、释放设备内存、实现设备之间数据传输、维护任务队列,以及根据优先级调度任务,实现多设备间的同步和协作。
编译器及编译语言603可以是针对AI处理器601的指令集研发的一套汇编语言。在应用中,其可以将面向AI处理器601开发的深度学习算子翻译成处理器指令组合,以便于调用AI处理器601,从而高效地使用该AI处理器601。
库604可以包括运行时库614和机器学习库624。在一个实施场景中,前述库604可以使用AI处理器601的指令集并根据AI处理器601的指令集进行部分优化,以提高算子的运行速度。运行时库614可以是针对AI处理器601专门开发的一套高性能算子库,并且其可以用于完成通用处理器和人工智能处理器之间的交互。进一步,该运行时库614还可以提供一套面向人工智能处理器的接口。对于机器学习库624,其可以用于在人工智能处理器上加速各种机器学习或者深度学习算法。具体地,该机器学习库624可以提供一套高效、通用、灵活且可扩展的编程接口,其上层的机器学习应用可以直接采用各种编程框架(例如TensorFlow、Caffe、MXNet等)的编程接口,也可以使用机器学习库624提供的接口来直接编程。另外,本公开的机器学习库624可以方便硬件平台的调用,而运行时库614可以实现一些基础的常用算子,如卷积、池化等各种操作。
框架层605可以增加对面向AI处理器开发的算子的封装,并且主要是对运行时库614的算子的封装。除此之外,框架层605还可以修改相关的任务调度或内存管理等部分。应用层606可以是深度学习算法开发者提供的应用平台,并且基于原生的框架层605拓展了模型运行时对AI处理器601调用的支持。在实际应用场景中,框架层605可以实现对运行时库614中高性能算子库里算子的封装与支持,并且其主要是利用数据流图根据图优化机制构建起深度学习模型的计算过程。
以上结合图6-图11示例性地对本公开的硬件架构、软件架构以及二者的结合及其内部结构进行了详细的描述。可以理解的是上述描述仅仅是示例性的而非限制性的,并且根据不同的应用场景和硬件规格,本领域技术人员也可以对本公开前述的板卡及其内部结构进行改变,而这些改变依然落入本公开的保护范围内。基于前述的软硬件架构,还结合图12-图20来描述本公开所提出的维度处理方案。通过利用本公开的维度处理方案,可以提升多维张量数据的处理效率,特别是提升本公开前述软硬件架构在执行数据拷贝时的运行效率。
以上结合附图对本公开的方案进行了详细的描述。根据不同的应用场景,本披露的电子设备或集成电路装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本披露的电子设备或集成电路装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本披露的电子设备或集成电路装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本披露方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或集成电路装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本披露将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本披露的方案并不受所描述的动作的顺序限制。因此,依据本披露的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本披露所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本披露某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本披露对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本披露某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本披露的公开和教导,本领域技术人员可以理解本披露所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或集成电路装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行划分,而实际实现时也可以有另外的划分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本披露中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本披露实施例所述方案的目的。另外,在一些场景中,本披露实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本披露的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本披露实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(“Read Only Memory”,简写为ROM)、随机存取存储器(“Random Access Memory”,简写为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如CPU、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(“Resistive Random Access Memory”,简写为RRAM)、动态随机存取存储器(“Dynamic Random Access Memory”,简写为DRAM)、静态随机存取存储器(“Static Random Access Memory”,简写为SRAM)、增强动态随机存取存储器(“Enhanced Dynamic Random Access Memory”,简写为“EDRAM”)、高带宽存储器(“High Bandwidth Memory”,简写为“HBM”)、混合存储器立方体(“Hybrid Memory Cube”,简写为“HMC”)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1:一种编译优化方法,包括:
获取待处理张量数据的维度信息,其中,所述维度信息包含所述待处理张量数据的至少一个维度的维度值;
根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息;
根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行维度折叠。
条款A2:根据A1所述的方法,所述维度折叠信息包括可折叠维度;
所述根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息,包括:
根据所述待处理张量数据的维度信息,确定所述待处理张量数据的相邻维度是否连续;
当所述待处理张量数据的相邻维度连续时,则确定所述相邻维度为可折叠维度。
条款A3:根据条款A2或A1所述的方法,所述待处理张量数据的相邻维度包括第一维度和第二维度,所述方法还包括:
当所述第一维度方向上的数据之间的步长等于第二维度方向上数据之间的步长和第二维度方向上数据的尺寸之积时,则确定所述待处理张量数据的相邻维度连续。
条款A4:根据条款A1-A3任一项所述的方法,所述维度折叠信息还包括维度折叠索引;所述根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度 折叠信息,还包括:
根据所述待处理张量数据中可折叠维度的数量,确定可折叠维度索引。
条款A5:根据条款A1-A4任一所述的方法,所述根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行降维处理,包括:
当所述待处理张量数据的维度值为常量,则根据所述待处理张量数据的维度折叠信息确定折叠后的张量维度,以实现对所述待处理张量数据的降维处理;
其中,所述折叠后的张量维度等于所述待处理张量数据的可折叠维度的维度值的乘积。
条款A6:根据条款A1-A5任一项所述的方法,所述方法还包括:
若所述待处理张量数据的维度值为常量,则将相应维度值替换为所述常量。
条款A7:根据条款A1-A6任一所述的方法,所述待处理张量数据指向操作中的操作数;所述方法还包括:
根据所述操作中至少一个操作数对应的维度折叠信息,确定目标维度折叠信息;所述目标维度折叠信息用于指示所述至少一个操作数在所述操作中的维度折叠方式。
条款A8:根据条款A1-A7所述的方法,所述目标维度折叠信息包括目标可折叠维度和目标维度折叠索引;所述根据所述操作中至少一个操作数对应的维度折叠信息,确定目标维度折叠信息,包括:
根据所述至少一个操作数的可折叠维度,确定目标可折叠维度,其中所述目标可折叠维度为所述至少一个操作数的可折叠维度的交集;
基于所述目标可折叠维度和所述至少一个操作数的维度折叠指引,确定所述目标维度折叠索引。
条款A9:根据条款A1-A8任一所述的方法,所述方法还包括:
根据所述目标维度折叠信息,对所述操作中的所有操作数进行维度折叠。
条款A10:根据条款A1-A9任一所述的方法,所述方法还包括:
在对所述操作中的操作数进行维度折叠之后,编译生成所述操作获得所述操作操作对应的指令,以使目标处理器根据所述指令实现所述操作的功能,获得结果数据。
条款A11:根据条款A1-A10任一所述的方法,所述方法还包括:
根据所述目标折叠信息,对所述结果数据进行维度展开,使得所述结果数据的表达形式与所述操作数的初始表达形式一致。
条款A12:一种计算机设备:包括:
处理器;
用于存储处理器计算机可执行程序的存储器;
其中,所述处理器被配置为执行所述计算机可执行程序,以实现条款A1-A11中任意一项所述的方法。
条款A13:一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现条款A1-A11中任意一项所述的方法。
条款B1.一种用于处理多维张量数据的方法,该方法由处理器执行,并且包括:
获取与所述多维张量数据中的第一维度关联的第一表达式;
获取与所述多维张量数据中的第二维度关联的第二表达式,其中所述第一维度和第二维度是多维张量数据的相邻维度;以及
判断所述第一表达式与所述第二表达式是否相等,以便确定所述第一维度和所述第二维度是否能够进行维度折叠。
条款B2.根据条款B1所述的方法,还包括:
获取根据所述第一表达式构建的包含节点的第一抽象语法树;
获取根据所述第二表达式构建的包含节点的第二抽象语法树;以及
通过判断所述第一抽象语法树和第二抽象语法树是否相同来判断所述第一表达式与所述第二表达式是否相等。
条款B3.根据条款B2所述的方法,其中所述第一表达式基于神经网络模型中的多维张量数据的第一维度的跨度,而所述第二表达式基于所述神经网络模型中的多维张量数据的第二维度的跨度和尺寸。
条款B4.根据条款B3所述的方法,其中获取所述第一抽象语法树包括:
沿所述神经网络模型的中间表达向前追溯获取所述第一维度的跨度的计算过程;以及
在所述向前追溯期间,基于所述计算过程构建:
所述第一抽象语法树的根节点;以及
所述根节点下的叶子节点或者所述根节点下的子节点和叶子节点。
条款B5.根据条款B3所述的方法,其中获取所述第二抽象语法树包括:
沿所述神经网络模型的中间表达向前追溯获取所述第二维度的跨度和尺寸的计算过程;以及
在所述向前追溯期间,基于所述计算过程来构建:
所述第二抽象语法树的根节点;以及
所述根节点下的叶子节点或者所述根节点下的子节点和叶子节点。
条款B6.根据条款B4或B5所述的方法,还包括响应于追溯至发生如下情形之一时,停止所述向前追溯:
沿所述中间表达追溯至内核函数的入口参数时;或者
沿所述中间表达追溯至特定的运算时。
条款B7.根据条款B4或B5所述的方法,其中在判断所述第一抽象语法树和第二抽象语法树是否相同前,所述方法还包括:
对所述第一抽象语法树或第二抽象语法树进行转换,以便转换后的第一抽象语法树和第二抽象语法树适于所述判断。
条款B8.根据条款B7所述的方法,其中对所述第一抽象语法树或第二抽象语法树进行转换包括:
将乘法运算对应的根节点或子节点向所述第一抽象语法树或第二抽象语法树的下层方向移动;以及
将加法运算对应的子节点和/或叶子节点向所述第一抽象语法树或第二抽象语法树的上层方向移动。
条款B9.根据条款B7所述的方法,其中对所述第一抽象语法树或第二抽象语法树进行转换包括:
确定所述第一抽象语法树或第二抽象语法树中是否存在运算类型相同的根节点和子节点;以及
响应于存在运算类型相同的根节点和子节点,将所述根节点和子节点进行合并。
条款B10.根据条款B7所述的方法,其中对所述第一抽象语法树或第二抽象语法树进行转换还包括对所述第一抽象语法树或第二抽象语法树执行化简操作。
条款B11.根据条款B10所述的方法,其中所述化简操作包括对第一抽象语法树或第二抽象语法树执行以下之一的化简操作:
对表示立即数的多个节点执行合并;
对表示立即数乘以变量和所述变量的多个节点执行合并;以及
对表示多个相同变量的多个节点进行合并。
条款B12.根据条款B11所述的方法,其中在执行合并中,所述方法包括:
对所述第一抽象语法树或第二抽象语法树中的节点所表示的变量执行基于特征的哈希运算;
对所述哈希运算的运算结果进行排序;以及
对排序后具有相同运算结果的节点进行合并。
条款B13.根据条款B7所述的方法,其中对所述第一抽象语法树或第二抽象语法树进行转换还包括:
消除所述第一抽象语法树或第二抽象语法树中系数为0的分支;和/或
消除所述第一抽象语法树或第二抽象语法树中仅有一个节点的单分支。
条款B14.根据条款B8-B13的任意一项所述的方法,其中所述第一抽象语法树或第二抽象语法树是由节点构成的分层结构,其中判断所述第一抽象语法树和第二抽象语法树是否相同包括针对于转换后的第一抽象语法树和/或第二抽象语法树:
逐层地判断所述第一抽象语法树和第二抽象语法树是否相同;或者
判断所述第一抽象语法树和第二抽象语法树的前预定数目的层是否相同。
条款B15.根据条款B14所述的方法,其中判断所述第一抽象语法树和第二抽象语法树是否相同包括:
判断所述第一抽象语法树的第一根节点和所述第二抽象语法树的第二根节点是否具有相同的运算类型;以及
判断所述第一根节点下的所有子节点和叶子节点与所述第二根节点下的对应所有子节点和叶子节点是否彼此相同或等价。
条款B16.根据条款B15所述的方法,其中判断所述第一根节点下的所有子节点与所述第二根节点下的所有子节点是否彼此等价包括:
对所述第一根节点和第二根节点下的所有子节点所表示的变量分别执行基于指纹的哈希运算;
对所述哈希运算的运算结果进行排序;以及
将排序后第一根节点和第二根节点下具有相同运算结果的子节点判断为彼此等价。
条款B17.根据条款B14所述的方法,还包括:
响应于判断所述第一表达式和所述第二表达式相等,确定所述第一维度和所述第二维度能够进行维度折叠;
对所述多维张量数据的所述第一维度和第二维度执行维度折叠;以及
生成输入/输出指令,所述输入/输出指令用于对维度折叠后所获得的张量数据执 行输入/输出操作。
条款B18.一种用于处理多维张量数据的设备,包括:
处理器;以及
存储器,其存储有用于处理多维张量数据的程序指令,当所述程序指令由所述处理器执行时,实现根据条款B1-B17的任意一项所述的方法。
条款B19.一种计算机可读存储介质,其存储有用于处理多维张量数据的程序指令,当所述程序指令由处理器执行时,实现根据条款B1-B17的任意一项所述的方法。
条款B20.一种集成电路装置,包括:
处理器,其用于执行与神经网络模型关联的计算任务;以及
存储器,其存储有对与神经网络模型关联的程序指令进行编译后所获得的二进制程序指令,其中所述神经网络模型经由根据条款B1-B17的任意一项所述的方法进行优化,
其中当所述二进制程序指令由所述处理器运行时,执行与所述神经网络模型关联的所述计算任务。
条款B21.一种板卡,包括根据条款B20所述的集成电路装置。
虽然本文已经示出和描述了本披露的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本披露思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本披露的过程中,可以采用对本文所描述的本披露实施例的各种替代方案。所附权利要求书旨在限定本披露的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。

Claims (26)

  1. 一种编译优化方法,其特征在于,包括:
    获取待处理张量数据的维度信息,其中,所述维度信息包含所述待处理张量数据的至少一个维度的维度值;
    根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息;
    根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行维度折叠。
  2. 根据权利要求1所述的方法,其特征在于,所述维度折叠信息包括可折叠维度;所述根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息,包括:
    根据所述待处理张量数据的维度信息,确定所述待处理张量数据的相邻维度是否连续;
    当所述待处理张量数据的相邻维度连续时,则确定所述相邻维度为可折叠维度。
  3. 根据权利要求2所述的方法,其特征在于,所述待处理张量数据的相邻维度包括第一维度和第二维度,所述方法还包括:
    当所述第一维度方向上的数据之间的跨度等于第二维度方向上数据之间的跨度和第二维度方向上数据的尺寸之积时,则确定所述待处理张量数据的相邻维度连续。
  4. 根据权利要求1或2所述的方法,其特征在于,所述维度折叠信息还包括维度折叠索引;所述根据所述待处理张量数据的维度信息,确定所述待处理张量数据的维度折叠信息,还包括:
    根据所述待处理张量数据中可折叠维度的数量,确定可折叠维度索引。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,若所述待处理张量数据的维度值为常量,则将相应维度值替换为所述常量;所述根据所述待处理张量数据的维度折叠信息,对所述待处理张量数据进行降维处理,包括:
    当所述待处理张量数据的维度值为常量,则根据所述待处理张量数据的维度折叠信息确定折叠后的张量维度,以实现对所述待处理张量数据的降维处理;
    其中,所述折叠后的张量维度等于所述待处理张量数据的可折叠维度的维度值的乘积。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:
    若所述待处理张量数据的维度值为变量,则根据所述待处理张量数据的相邻维度相关参数的表达式来确定所述待处理张量数据的相邻维度是否连续,其中,所述待处理张量数据为多维张量数据。
  7. 根据权利要求6所述的方法,其特征在于,所述的根据所述待处理张量数据的相邻维度相关参数的语法树来确定所述待处理张量数据的相邻维度是否连续,包括:
    获取与所述多维张量数据中的第一维度关联的第一表达式;
    获取与所述多维张量数据中的第二维度关联的第二表达式,其中所述第一维度和第二维度是多维张量数据的相邻维度;以及
    判断所述第一表达式与所述第二表达式是否相等,响应于判断所述第一表达式和 所述第二表达式相等,确定所述第一维度和所述第二维度能够进行维度折叠。
  8. 根据权利要求7所述的方法,其特征在于,还包括:
    获取根据所述第一表达式构建的包含节点的第一抽象语法树,其中,所述第一抽象语法树包括根节点、叶子节点和子节点;
    获取根据所述第二表达式构建的包含节点的第二抽象语法树,其中,所述第二抽象语法树包括根节点、叶子节点和子节点;以及
    通过判断所述第一抽象语法树和第二抽象语法树是否相同来判断所述第一表达式与所述第二表达式是否相等。
  9. 根据权利要求7或8所述的方法,其特征在于,所述第一表达式基于神经网络模型中的多维张量数据的第一维度的跨度,而所述第二表达式基于所述神经网络模型中的多维张量数据的第二维度的跨度和尺寸。
  10. 根据权利要求8所述的方法,其特征在于,在判断所述第一抽象语法树和第二抽象语法树是否相同前,所述方法还包括:
    对所述第一抽象语法树或第二抽象语法树进行转换,以便转换后的第一抽象语法树和第二抽象语法树适于所述判断。
  11. 根据权利要求10所述的方法,其特征在于,对所述第一抽象语法树或第二抽象语法树进行转换包括:
    将乘法运算对应的根节点或子节点向所述第一抽象语法树或第二抽象语法树的下层方向移动;以及
    将加法运算对应的子节点和/或叶子节点向所述第一抽象语法树或第二抽象语法树的上层方向移动。
  12. 根据权利要求10所述的方法,其特征在于,对所述第一抽象语法树或第二抽象语法树进行转换包括:
    确定所述第一抽象语法树或第二抽象语法树中是否存在运算类型相同的根节点和子节点;以及
    响应于存在运算类型相同的根节点和子节点,将所述根节点和子节点进行合并。
  13. 根据权利要求10所述的方法,其特征在于,对所述第一抽象语法树或第二抽象语法树进行转换还包括对所述第一抽象语法树或第二抽象语法树执行化简操作。
  14. 根据权利要求13所述的方法,其特征在于,所述化简操作包括对第一抽象语法树或第二抽象语法树执行以下之一的化简操作:
    对表示立即数的多个节点执行合并;
    对表示立即数乘以变量和所述变量的多个节点执行合并;以及
    对表示多个相同变量的多个节点进行合并。
  15. 根据权利要求14所述的方法,其特征在于,在执行合并中,所述方法包括:
    对所述第一抽象语法树或第二抽象语法树中的节点所表示的变量执行基于特征的哈希运算;
    对所述哈希运算的运算结果进行排序;以及
    对排序后具有相同运算结果的节点进行合并。
  16. 根据权利要求10所述的方法,其特征在于,对所述第一抽象语法树或第二抽象语法树进行转换还包括:
    消除所述第一抽象语法树或第二抽象语法树中系数为0的分支;和/或
    消除所述第一抽象语法树或第二抽象语法树中仅有一个节点的单分支。
  17. 根据权利要求8-16的任意一项所述的方法,其特征在于,所述第一抽象语法树或第二抽象语法树是由节点构成的分层结构,其中判断所述第一抽象语法树和第二抽象语法树是否相同包括针对于转换后的第一抽象语法树和/或第二抽象语法树:
    逐层地判断所述第一抽象语法树和第二抽象语法树是否相同;或者
    判断所述第一抽象语法树和第二抽象语法树的前预定数目的层是否相同。
  18. 根据权利要求17所述的方法,其特征在于,判断所述第一抽象语法树和第二抽象语法树是否相同包括:
    判断所述第一抽象语法树的第一根节点和所述第二抽象语法树的第二根节点是否具有相同的运算类型;以及
    判断所述第一根节点下的所有子节点和叶子节点与所述第二根节点下的对应所有子节点和叶子节点是否彼此相同或等价。
  19. 根据权利要求18所述的方法,其特征在于,判断所述第一根节点下的所有子节点与所述第二根节点下的所有子节点是否彼此等价包括:
    对所述第一根节点和第二根节点下的所有子节点所表示的变量分别执行基于指纹的哈希运算;
    对所述哈希运算的运算结果进行排序;以及
    将排序后第一根节点和第二根节点下具有相同运算结果的子节点判断为彼此等价。
  20. 根据权利要求1所述的方法,其特征在于,所述待处理张量数据指向操作中的操作数;所述方法还包括:
    根据所述操作中至少一个操作数对应的维度折叠信息,确定目标维度折叠信息;所述目标维度折叠信息用于指示所述至少一个操作数在所述操作中的维度折叠方式。
  21. 根据权利要求20所述的方法,其特征在于,所述目标维度折叠信息包括目标可折叠维度和目标维度折叠索引;所述根据所述操作中至少一个操作数对应的维度折叠信息,确定目标维度折叠信息,包括:
    根据所述至少一个操作数的可折叠维度,确定目标可折叠维度,其中所述目标可折叠维度为所述至少一个操作数的可折叠维度的交集;
    基于所述目标可折叠维度和所述至少一个操作数的维度折叠指引,确定所述目标维度折叠索引。
  22. 根据权利要求20或21所述的方法,其特征在于,所述方法还包括:
    根据所述目标维度折叠信息,对所述操作中的所有操作数进行维度折叠。
  23. 根据权利要求22所述的方法,其特征在于,所述方法还包括:
    在对所述操作中的操作数进行维度折叠之后,编译生成所述操作获得所述操作操作对应的指令,以使目标处理器根据所述指令实现所述操作的功能,获得结果数据。
  24. 根据权利要求23所述的方法,其特征在于,所述方法还包括:
    根据所述目标折叠信息,对所述结果数据进行维度展开,使得所述结果数据的表达形式与所述操作数的初始表达形式一致。
  25. 一种计算机设备,其特征在于,包括:
    处理器;
    用于存储处理器计算机可执行程序的存储器;
    其中,所述处理器被配置为执行所述计算机可执行程序,以实现权利要求1至24中任意一项所述的方法。
  26. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至24中任意一项所述的方法。
PCT/CN2022/116879 2021-09-03 2022-09-02 编译优化方法、装置、计算机设备以及存储介质 WO2023030507A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202111033297.8 2021-09-03
CN202111033876.2A CN115840894A (zh) 2021-09-03 2021-09-03 一种用于处理多维张量数据的方法及其相关产品
CN202111033876.2 2021-09-03
CN202111033297.8A CN115756722A (zh) 2021-09-03 2021-09-03 编译优化方法、装置、计算机设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2023030507A1 true WO2023030507A1 (zh) 2023-03-09

Family

ID=85411977

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116879 WO2023030507A1 (zh) 2021-09-03 2022-09-02 编译优化方法、装置、计算机设备以及存储介质

Country Status (1)

Country Link
WO (1) WO2023030507A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108008948A (zh) * 2016-11-30 2018-05-08 上海寒武纪信息科技有限公司 一种指令生成过程的复用装置及方法、处理装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154194A (zh) * 2018-01-18 2018-06-12 北京工业大学 一种用基于张量的卷积网络提取高维特征的方法
CN112132175A (zh) * 2020-08-14 2020-12-25 深圳云天励飞技术股份有限公司 对象分类方法、装置、电子设备及存储介质
US20210117806A1 (en) * 2019-06-27 2021-04-22 Advanced Micro Devices, Inc. Composable neural network kernels
CN114492730A (zh) * 2021-12-23 2022-05-13 北京地平线信息技术有限公司 神经网络模型的编译方法和装置、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154194A (zh) * 2018-01-18 2018-06-12 北京工业大学 一种用基于张量的卷积网络提取高维特征的方法
US20210117806A1 (en) * 2019-06-27 2021-04-22 Advanced Micro Devices, Inc. Composable neural network kernels
CN112132175A (zh) * 2020-08-14 2020-12-25 深圳云天励飞技术股份有限公司 对象分类方法、装置、电子设备及存储介质
CN114492730A (zh) * 2021-12-23 2022-05-13 北京地平线信息技术有限公司 神经网络模型的编译方法和装置、电子设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108008948A (zh) * 2016-11-30 2018-05-08 上海寒武纪信息科技有限公司 一种指令生成过程的复用装置及方法、处理装置
CN108008948B (zh) * 2016-11-30 2023-08-25 上海寒武纪信息科技有限公司 一种指令生成过程的复用装置及方法、处理装置

Similar Documents

Publication Publication Date Title
WO2021000970A1 (zh) 深度学习算法的编译方法、装置及相关产品
CN112292667B (zh) 选择处理器的方法和装置
WO2021000971A1 (zh) 操作数据的生成方法、装置及相关产品
WO2023071238A1 (zh) 计算图的编译、调度方法及相关产品
WO2023093623A1 (zh) 计算图的优化方法、数据处理方法及相关产品
CN112465133B (zh) 控制流多核并行方法、计算机设备和存储介质
CN112070202B (zh) 一种融合图的生成方法、生成装置和计算机可读存储介质
WO2022253075A1 (zh) 一种编译方法及相关装置
CN111831582B (zh) 用于智能处理器的内存管理装置、方法及电子设备
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022078400A1 (zh) 一种对多维数据进行处理的设备、方法和计算机程序产品
CN115840894A (zh) 一种用于处理多维张量数据的方法及其相关产品
CN111667060B (zh) 深度学习算法的编译方法、装置及相关产品
WO2022095676A1 (zh) 神经网络稀疏化的设备、方法及相应产品
CN111831333B (zh) 用于智能处理器的指令分解方法、装置及电子设备
CN116185377A (zh) 计算图的优化方法、计算装置及相关产品
CN116185378A (zh) 计算图的优化方法、数据处理方法及相关产品
CN115329923A (zh) 用于神经网络模型的编译方法和相关产品
WO2021000638A1 (zh) 深度学习算法的编译方法、装置及相关产品
CN115756722A (zh) 编译优化方法、装置、计算机设备以及存储介质
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2022063183A1 (zh) 执行神经网络计算的装置、板卡、方法及可读存储介质
US11809849B1 (en) Global modulo allocation in neural network compilation
CN111831339B (zh) 用于智能处理器的指令执行方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863641

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE