CN116185377A - Optimization method and device for calculation graph and related product - Google Patents

Optimization method and device for calculation graph and related product Download PDF

Info

Publication number
CN116185377A
CN116185377A CN202111433244.5A CN202111433244A CN116185377A CN 116185377 A CN116185377 A CN 116185377A CN 202111433244 A CN202111433244 A CN 202111433244A CN 116185377 A CN116185377 A CN 116185377A
Authority
CN
China
Prior art keywords
operator
data
operators
memory
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111433244.5A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202111433244.5A priority Critical patent/CN116185377A/en
Priority to PCT/CN2022/132745 priority patent/WO2023093623A1/en
Publication of CN116185377A publication Critical patent/CN116185377A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/34Graphical or visual programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/37Compiler construction; Parser generation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present disclosure discloses a method of optimizing a computational graph, a computing device, a computer-readable storage medium, and a computer program product. The computing means performing the optimization method of the computation graph may be comprised in a combined processing means, which may also comprise interface means and other processing means. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. According to the scheme, through optimizing the view class operator subgraph, the carrying of the memory of the equipment end and the calling of operators can be reduced, and therefore the processing efficiency of a machine is improved.

Description

Optimization method and device for calculation graph and related product
Technical Field
The present disclosure relates generally to the field of intelligent computing, and more particularly to the field of compilation. More particularly, the present disclosure relates to a method of optimizing a computational graph, a computing device, a computer-readable storage medium, and a computer program product.
Background
In intelligent computing systems, the programming framework provides an interface for programmers to use hardware and systems, which is a very critical core hub in intelligent computing systems. On the one hand, the programming framework can encapsulate common operations in the algorithm into operators for direct calling by programmers, such as convolution, pooling and the like; on the other hand, as an interface between software and hardware, the programming framework can encapsulate the hardware architecture, so that the complexity and difficulty of writing or applying the deep learning algorithm are reduced, and the implementation efficiency of the algorithm is improved.
TensorFlow, pyTorch, etc. are currently popular deep learning frameworks. In these programming frameworks, computational graphs are typically used to describe the computation of machine learning algorithms, with tensors representing all data in the computational graph and operators representing various operations. There are such operators, such as transpose, slice, split, that only change the appearance or look of tensor data, and not the actual arrangement of tensor data in memory, i.e. the actual memory data handling. Such operators may be referred to as view-type operators.
Due to this nature of view-type operators, tensor data is typically discontinuous in memory. When discontinuous data is read, the problems of low memory access efficiency, high time consumption and the like of hardware equipment are caused. In addition, when the view class operators are more, a large amount of memory data continuity processing is required, resulting in a great time overhead.
Disclosure of Invention
To at least partially solve one or more of the technical problems mentioned in the background, the present disclosure provides solutions from a number of aspects. In one aspect, an optimization method of a computation graph is provided, which is used for performing memory data continuity processing subsequently by constructing a view class operator subgraph. On the other hand, a further optimization method of the calculation graph is provided, which can perform operator fusion according to the interrelationship of view operators based on the pre-constructed view operator subgraphs, and reduce the carrying of the memory at the equipment end and the calling of the operators, thereby improving the data memory access efficiency.
In a first aspect, the present disclosure discloses a method of optimizing a computational graph, comprising: obtaining a view class operator subgraph of tensor data in the calculation graph, wherein the view class operator subgraph comprises a view class source operator associated with the tensor data; replacing the source operator in the view operator subgraph with a target operator with a specified function capable of being replaced by each other according to the function of the source operator in the view operator subgraph; and fusing a plurality of target operators which are continuously identical into a single target operator to generate a fused view class operator subgraph.
In a second aspect, the present disclosure discloses a computing device for optimizing a computational graph, comprising: a processor configured to execute program instructions; and a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform a method of optimizing a computational graph according to the first aspect of the present disclosure.
In a third aspect, the present disclosure discloses a computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform a method of optimizing a computational graph according to the first aspect of the present disclosure.
In a fourth aspect, the present disclosure discloses a computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of optimizing the computational graph of the first aspect of the present disclosure.
According to the optimization method of the computational graph, the operator subgraphs which are built in advance based on the view type operators in the computational graph can be optimized, and operators of the same type are fused, so that data handling of a memory and calling of the operators can be reduced, and data access and storage efficiency is improved.
Drawings
The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1 illustrates different shapes of a multi-dimensional array and its order of storage on memory;
FIG. 2 illustrates an exemplary flow chart of a method of optimizing a computational graph according to embodiments of the present disclosure;
FIG. 3 illustrates an exemplary flow chart of a method of optimizing a computational graph according to another embodiment of the present disclosure;
FIGS. 4 a-4 c illustrate the structure of several exemplary computational graphs and the structure of a correspondingly constructed view class operator subgraph;
FIGS. 5 a-5 b illustrate a simple example of operator fusion;
FIG. 6 illustrates an exemplary method flow diagram for operator fusion according to some embodiments of the present disclosure;
FIG. 7 illustrates an exemplary flow chart of a data processing method according to an embodiment of the disclosure;
FIG. 8 illustrates a block diagram of a hardware configuration of a computing device in which various aspects of embodiments of the present disclosure may be implemented;
FIG. 9 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure; and
fig. 10 shows a schematic structural view of a board according to an embodiment of the disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," and the like, as may appear in the claims, specification and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.
Embodiments of the present disclosure are described in detail below with reference to the attached drawings.
In the programming framework of intelligent computing systems, data is typically modeled as tensors (tensors). Tensors can be considered as N-dimensional arrays, where the dimensions of the array are the tensor's order. Thus, the 0 th order tensor corresponds to scalar data; the 1 st order tensor corresponds to a one-dimensional array, i.e., a vector; the 2-order tensor corresponds to a two-dimensional array, i.e., a matrix; similarly, the N-order tensors correspond to N-dimensional arrays. For example, one RGB image may be represented as a 3-order tensor, and a dataset of multiple RGB images may be represented as a 4-order tensor.
Each tensor has some common attributes including data type, shape, etc. The shape of the tensor represents the length of each order of the tensor. For example, a 0-order tensor corresponds to a scalar data whose shape is null; a 1-order tensor corresponds to a one-dimensional vector, the shape of which comprises an element, and the value of the element is the length of the vector; a 2-order tensor corresponds to a matrix, the shape of which comprises two elements, corresponding to the lengths of the rows and columns respectively; a3-order tensor corresponds to three-dimensional data, and its shape includes three elements, corresponding to the length of each order.
Although the multi-dimensional array has a plurality of dimensions, there is a correspondence between the multi-dimensional array and the storage order on the memory because the layout of the memory (e.g., memory DRAM and cache RAM) is always one-dimensional. The multi-dimensional arrays are typically allocated in contiguous memory space, i.e., the multi-dimensional arrays can be one-dimensionally expanded and stored in sequence on memory.
FIG. 1 illustrates different shapes of multi-dimensional arrays and their order of storage on memory, where storage of the multi-dimensional arrays is accomplished using a one-dimensional array of a contiguous memory.
Fig. 1 (a) illustrates first data, i.e., a three-dimensional array X, having three dimensions, dimension 0 (dim 0), dimension 1 (dim 1), and dimension 2 (dim 2), respectively. Dimension 0 is 2, dimension 1 is 2, and dimension 2 is 3. The shape (size) thereof can be expressed as: x is X 3 =(2,2,3)。
Fig. 1 (c) illustrates the storage order of the three-dimensional array X in the memory, in which the data representing the same background are located in the same dimension. In the storing, assuming that the first data is stored in a low-dimensional priority manner (for example, the shape representation corresponds to a high-dimensional mode to a low-dimensional mode from left to right), one-dimensional expansion is performed on the first data, so that the following can be obtained:
X=[1,2,3,4,5,6,7,8,9,10,11,12]。
More specifically, the lowest dimension (same row) of data is continuous, and higher dimension of data is spaced apart by different distances. For example, in the storage manner shown in (c), the physical structure of adjacent elements in the access dimension dim2 needs to be shifted by 1 position (e.g., from data 1 to data 2, data 5 to data 6, etc.); accessing the adjacent element physical structure on dimension dim1 requires shifting 3 positions (e.g., from data 1 to data 4, data 2 to data 5, …, data 9 to data 12, etc.); while accessing the physical structure of adjacent elements in dimension dim0 requires shifting 6 positions (e.g., from data 1 to data 7, data 2 to data 8, …, data 6 to data 12, etc.). This offset is called a stride (stride). The step size of each dimension of the three-dimensional array X can be expressed as S X =(6,3,1)。
In the programming framework of intelligent computing systems, there are view-type operators that operate on the extrinsic manifestations of tensors, such as transfer, slice, split, etc. Taking a transonse as an example, it is perm according to some dimension conversion rule N =(p 1 ,p 2 ,…,p i ,…,p N ) Obtaining a dimension-converted data arrangement, wherein p i The value of (i.epsilon.1, 2, …, N) represents the original dimension of the array, p i At perm N Represents the target dimension of the transformation. For example, given dimension conversion rule perm 3 = (0,2,1) representing that dimension 1 is to be swapped with dimension 2, i.e. original dimension 1 is to be converted into dimension 2 of the new array, original dimension 2 is to be converted into dimension 1 of the new array.
Fig. 1 (b) illustrates a converted array Y obtained by performing a transpose transfer operator on the three-dimensional array X illustrated in fig. (a). In this example, the above-described exemplary dimension conversion rule perm is applied 3 = (0,2,1). It can be seen from the figure that dimension 1 and dimension 2 of array Y are swapped compared to array X. At this time, the dimension information of the three-dimensional array Y may be expressed as: y is Y 3 =(2,3,2)。
However, since the view operator does not change the storage position of the data on the memory, the storage order of the array Y obtained after the transpose operation on the memory is still as shown in fig. 1 (c). At this time, according to the storage order in (c), the step size of each dimension of the array Y becomes S Y = (6,1,3). It can be seen that the current storage order of array Y is discontinuous if the data is stored sequentially on the principle of low-dimensional priority, referred to as continuity. That is, after passing through the transfer operator, the order of storage of the array on the memory becomes discontinuous because the dimensional order of the array is changed, but the storage position on the memory is not changed.
If it is desired that array Y be contiguous in memory, then its one-dimensional expansion should be as shown in FIG. 1 (d) according to the low-dimensional priority principle:
Y=[1,4,2,5,3,6,7,10,8,11,9,12]。
the shape of the tensor may help the programmer to create an intuitive feel for the tensor. In a programming framework such as Pytorch, the View class operator may change the shape (size), step size (span of the first index between adjacent dimensions of the tensor), store offset (offset of the first element of the tensor relative to the store start position) and other attributes, but not change the true store position of the tensor. At this time, the tensor calculates the memory location of the data at the device side through size, stride and storage_offset.
Assume that the tensor size is (s 0 ,s 1 ,s 2 ,…,s i ) Stride is (y) 0 ,y 1 ,y 2 ,…,y i ) If storage_offset is b, then the tensor is at point (x 0 ,x 1 ,x 2 ,…,x i ) The basic formula of the memory location calculation corresponding to the position is:
Figure BDA0003380963760000061
wherein dptr is the initial position stored in the memory corresponding to the tensor, and dtype is the data type of the tensor.
As can be seen from the foregoing description in connection with FIG. 1, tensors in the computational graph are processed by view-type operators to produce discrete corresponding data. When accessing these tensors, the conventional CPU and GPU need to perform discontinuous data access and reading according to the above formula, which causes problems of low access efficiency, high time consumption and the like of the hardware device. Another way is to call the configuration () operator, first carry the data one by one into continuous storage according to the above formula, and then perform subsequent access and operation. However, this approach is time consuming and when the amount of data is large, the individual data handling will be extremely time consuming.
In view of this, considering that in the computation of a computation graph such as a neural network, the generation of the discontinuity of the memory data is often caused by the view operator, the disclosure proposes a scheme for constructing a view operator sub-graph for the view operator in the computation graph, where the view operator sub-graph can then support the subsequent efficient implementation of the process of transferring the memory data from the discontinuity to the continuity.
With respect to the terms "node" and "operator" referred to in this disclosure, it should be noted that the term "operator" is in terms of the computational level of a computer (or in terms of the software level or the algorithm level); while the term "node" is a more visual statement (from a graphical level or a more visual level). The terms "operator" and "node" are actually referred to the same as the reference. That is, in the present disclosure, the terms "operator" and "node" may be considered to have the same meaning, and may be used interchangeably, only described from different sides.
FIG. 2 illustrates an exemplary flow chart of a method of optimizing a computational graph according to embodiments of the present disclosure. In the optimization method, a view class operator subgraph is constructed to support subsequent memory data continuity processing.
As shown, in step 210, for tensor data in a computational graph, operators associated with the tensor data are traversed.
A computational graph is a directed graph that includes nodes and edges, with tensors passing between the nodes of the computational graph. The execution of the computation graph is in accordance with the order of the directed graph, with each tensor passing through a node being computed as an input to the node's computation operation, with the computed result flowing along the output edge of the node to the following nodes. Thus, in building a view-type operator subgraph, nodes or operators to process tensor data may be traversed for the tensor data in the order of the directed graph.
Next, in step 220, when the operator encountered by the traversal is a view-class operator, the operator is extracted to construct a view-class operator subgraph.
In some embodiments, extracting the view class operator to construct a view class operator subgraph may include: the operator information and the operator sequence number of the operator are associatively cached; and adding the operator sequence number to the view class operator subgraph. In these embodiments, the structure of the view type operator subgraph can be simplified by separately storing the operator information and the view type operator subgraph and establishing a connection between the two operator subgraphs through the operator sequence numbers, so that subsequent memory data continuity processing is facilitated.
Each operator has attributes that identify relevant information when the operation is performed. Common attributes include: operator name, operator type, operator input data, operator output data, operational parameters, and the like. In some embodiments, the cached operator information may include at least one of: description information of input data of operators, description information of output data and operation parameters. It will be appreciated that the input and output data of the operator are tensor data, and the description information of the tensor data mainly includes the aforementioned shape, step size, storage offset, and the like.
The operational parameters of the operator are associated with the functions implemented by the operator. For example, taking the transfer operator as an example, the operation parameters thereof may include two dimensions (dim 0, dim 1) to be exchanged. For another example, for the chunk (segmentation) operator, the function of the chunk operator is to average segment the tensor according to the dimension dim, and the corresponding operation parameters may include the number of parts to be segmented (chunks), the dimension to be segmented (dim), and so on.
The above describes extracting view class operators that cause memory data discontinuities to construct view class operator subgraphs to support subsequent memory data continuity processing. It will be appreciated that for each tensor data, a view class operator subgraph of the tensor data may be constructed. Further, it is also understood that each tensor data may include multiple segments of view class operator subgraphs according to the continuity of the view class operators in the computation graph.
FIG. 3 illustrates an exemplary flow chart of a method of optimizing a computational graph according to another embodiment of the present disclosure. In this embodiment, the construction of view class operator subgraphs may be further optimized to simplify the storage of information.
As shown, when the extraction operator performs view class operator subgraph construction, for the encountered view class operator, first in step 310, it is checked whether the operator information of the operator is cached in the memory. The operator information may include, for example, description information of input data, description information of output data, and operation parameters of the above-mentioned operators.
If the operator information is not cached, indicating that the operator is a new operator with respect to the operators in memory, flow proceeds to step 320 where an operator sequence number is generated for the operator and the operator information and operator sequence number are cached in association as described above. Further, in step 330, the operator sequence number is added to the view class operator subgraph.
If the operator information is cached, the same information does not need to be cached repeatedly. Instead, flow proceeds directly to step 330 where only the operator sequence number of the cached operator is added to the view-type operator subgraph.
By the processing mode, the cached information quantity can be effectively reduced, and the construction of view operator subgraphs is simplified.
Fig. 4 a-4 c illustrate the structure of several exemplary computational graphs and the structure of a correspondingly constructed view class operator subgraph.
Fig. 4a shows a computational graph of a one-way structure, wherein the input tensor a 410 will pass sequentially through the following nodes according to the flow direction of the computational graph: a transfer operator 411, a slice operator 412, a slice operator 413, and a Matmul (matrix multiplication) operator 414. Among these operators, the transfer operator 411, the slice operator 412, and the slice operator 413 all belong to view class operators, and the Matmul (matrix multiplication) operator 414 is a computation class operator.
According to the view class operator subgraph construction scheme of the embodiment of the disclosure, the view class operators are extracted to form a view class operator subgraph. As shown on the right in fig. 4a, the view class operator subgraph includes, for the input tensor a, a transfer operator 411, a slice operator 412, and a slice operator 413 in order.
In some embodiments, assuming that the operator sequence number generated for the transfer operator 411 is 1, the operator sequence number of the slice operator 412 is 2, and the operator sequence number of the slice operator 413 is 3, the constructed view class operator subgraph may be represented as 1- >2- >3 using these operator sequence numbers. Operator information of the corresponding operator can be extracted from the cached information through the operator sequence number.
In other embodiments, assuming that the operator information of the slice operator 412 and the operator information of the slice operator 413 are the same, the two may store only one operator information and share the same operator sequence number. In such an embodiment, when the slice operator 413 is processed, the operator information of the slice operator 413 is found to be the same as the operator information cached for the previous slice operator 412, so that the caching step is not required to be performed, and the operator sequence number 2 of the cached slice operator 412 is directly assigned to the slice operator 413 and added to the view-class operator subgraph. At this time, the constructed view class operator is expressed as 1- >2- >2 using the operator number.
Fig. 4B shows a computational graph of a residual structure, wherein the input tensor B420 will pass the following nodes according to the computational graph flow direction: view operator 421, conv operator 422, act operator 423, and Add operator 424, wherein the output of view operator 421 is also input to Add operator 424 as another addend thereof. Of these operators, only the view operator 421 belongs to the view class operator, and the rest are all calculation class operators.
According to the view class operator subgraph construction scheme of the embodiment of the disclosure, view class operators included in the calculation graph are extracted to form a view class operator subgraph. As shown on the right in fig. 4B, the view class operator subgraph comprises only the view operator 421 for the input tensor B.
Fig. 4C shows a computational graph of a multi-branch structure, wherein the input tensor C430 will pass through the following nodes according to the computational graph flow direction: split operator 431, a transfer 1 operator 432, a transfer 2 operator 433, a transfer 3 operator 434, a BMM1 operator 435 that operates on the outputs of the first and second branches, a Softmax operator 436, and a BMM2 operator 437 that operates on the results of the first two branches and the output of the third branch, respectively, located on the three branches. Of these operators, split operator 431 and three transonsite operators 432 to 434 belong to the view class operator, and the rest are all computation class operators.
According to the view class operator subgraph construction scheme of the embodiment of the disclosure, view class operators included in the calculation graph are extracted to form a view class operator subgraph. As shown on the right in fig. 4C, for the input tensor C, the view class operator subgraph may be divided into three branches according to the operation parameters of the split operator 431, for example, the number of divided data blocks, and each branch includes the split operator 431 and one of the corresponding transfer operators 432 to 434. It can be seen that when the view class operator is a multi-branch operator, a view class operator subgraph including a corresponding number of branches can be constructed based on the multi-branch operator.
The above describes the construction scheme of view class operator subgraphs provided by embodiments of the present disclosure in connection with several examples. From the view class operator subgraphs constructed above, it can be seen that the current view class operator subgraphs only extract continuous view class operators, and no further processing is performed. When more view operators are used, the memory data continuity processing is carried out one by one according to the view operators, so that frequent operator calling and data carrying are caused, and the problems of repeated access, low access efficiency and increased network time consumption are generated.
One typical optimization for a computational graph is operator fusion, i.e., multiple operators are computed together in a single kernel, without saving intermediate results back to global memory.
For a better understanding of operator fusion, fig. 5 a-5 b show a simple example of operator fusion.
Assume that there are two operators in the graph that execute in sequence: the first operator and the second operator are replaced with (1) and (2) below. Fig. 5a is an operation flow without operator fusion, and the operation procedure is as follows:
1) Reading the input of the whole computational graph (i.e. the input of (1)) from DRAM (dynamic random access memory) to on-chip memory, e.g. PNM (parallel neuronal memory, parallel neuron memory), reading the weights of (1) to on-chip memory, e.g. PWM (parallel weight memory );
2) The PFU (parallel functional unit ) arithmetic unit finishes the operation from PNM and PWM fetch, and write the result of (1) back to PNM;
3) The result of (1) is written back from PNM to DRAM as input to (2).
Then, a second operator (2) is executed.
4) Reading (2) the input to PNM from DRAM, (2) the weight into PWM;
5) The PFU arithmetic unit finishes the operation from PNM and PWM fetch, and write the result of (2) back to PNM.
6) The result of (2) is written back into the DRAM as an output of the entire computational graph.
Fig. 5b shows the operation flow after operator fusion, at this time, the operation procedure is as follows:
a) Reading the input of the whole computational graph (i.e., (1) input) from the DRAM into PNM, and (1) and (2) weights into PWM;
b) The PFU operation unit finishes operation from PNM and PWM fetch, and writes the result of (1) back to PNM;
c) The PFU operation unit finishes operation from PNM and PWM fetch, and writes the result of (2) back to PNM;
d) The result of (2) is written back into the DRAM as an output of the entire computational graph.
As can be seen from a comparison of the above two processes, operator fusion can reduce steps 3) and 4) in the unfused pre-operation process, i.e. reduce the data handling of the same block of data (in this example the result of (1), which is the input of (2)) redundantly from PNM- > DRAM and DRAM- > PNM, i.e. reduce the data access step of the intermediate result, thereby increasing the operation speed.
In a specific implementation, the fused operator adopts compiling optimization means such as memory multiplexing, access optimization, instruction pipelining, data type optimization (for example, selecting for different applicable data types) and the like during compiling, so that the overall performance of the fused operator is remarkably improved.
In view of this, in the embodiments of the present disclosure, a scheme is provided for performing operator fusion on view-type operator subgraphs constructed by the foregoing method to optimize the operator subgraphs, thereby optimizing subsequent memory data continuity processing.
FIG. 6 illustrates an exemplary method flow diagram for operator fusion according to some embodiments of the present disclosure. In this embodiment, operator fusion strategy selection is performed by scanning pre-built view-type operator subgraphs.
As shown, in step 610, a view class operator subgraph is obtained that calculates tensor data in a graph, wherein the view class operator subgraph includes source operators of a view class associated with the tensor data.
The View class operator subgraph is constructed according to the method described above, as in several examples of fig. 4 a-4 c. It can be seen that the operators in the operator subgraph before optimization are the original view class operators in the computation graph, referred to herein as source operators, to distinguish them from the optimized operators.
Next, in step 620, the source operator in the view class operator subgraph is replaced with a target operator whose specified function can be replaced with each other according to its function.
In programming frameworks such as Pytorch, there are a wide variety of view-type operators to implement different functions. Such operators include, for example, but are not limited to: transpost, permite, select, chunk, narrow, slice, expand, view, and so forth.
The specific functions implemented by these operators, although varied, can be categorized. In some embodiments, the size impact of the operator implemented functions on the data can be divided into: scale-down, scale-up and scale-invariant functions. For example, regarding the operators listed above, the tensor data size is not changed, such as the transition (transpose), the permite (reorder), the view (warp), etc., and the operators belong to the scale-invariant class; select, chunk, narrow, slice, etc. reduce the size of tensor data, belonging to the scale-down class operator; and expansion and the like can expand the scale of tensor data, and belong to scale expansion operators.
For each class of functionality, an operator may be selected to represent that class of functionality. The operator can realize the functions of all operators under the corresponding function category. That is, the operator can functionally replace all operators under the corresponding functional class. In this context, the operator after substitution is referred to as the "target operator", and the operator before substitution is referred to as the "source operator". Several classes of functional partitions are exemplarily given in table 1 below, as well as source operators and target operators comprised by the classes of functions. It is to be understood that the operators herein are merely exemplary and not exhaustive and that similar functional classifications and functionally alternative target operators may be constructed by those skilled in the art in light of the principles of the disclosed embodiments.
Sequence number Source operator name Target operator nameWeighing scale
1. Scale is unchanged Transpose、permute、view permute
2. Scaling down Select、chunk、narrow、slice slice
3. Scale expansion expand expand
TABLE 1
It can be seen that the functions implemented by the source operators included in each class of functions are a subset of the corresponding target operators. By classifying the source operators according to functions and replacing the source operators with specified target operators, operator types in operator subgraphs can be reduced, and subsequent fusion operation is facilitated.
Continuing with FIG. 6, finally, in step 630, a plurality of consecutive identical target operators in the replaced operator subgraph are fused into a single target operator to generate a fused view-type operator subgraph.
By the replacement of the previous step, the functionally similar or analogous view class operators are replaced with the same target operator. When a plurality of identical target operators are continuous in position, the target operators can be fused into a single target operator, so that the number of operators is reduced, and the number of operators required to be called subsequently is reduced.
In some embodiments, fusing consecutive identical multiple target operators into a single target operator may include: the dimension operations of the plurality of target operators are combined so that the single target operator after the combination is equivalent to the plurality of target operators before the combination.
It will be appreciated that normally the dimension operations of a plurality of successive target operators are performed sequentially, each target operator in turn performing a dimension operation on tensor data it inputs. Since the target operators are contiguous and identical, these dimensional operations can be merged, thereby achieving the effect of multiple dimensional operations with a single target operator.
For example, assume that there are 2 consecutive operators in the view class operator subgraph, respectively: a chunk operator and a split operator. According to the functional classification, both the chunk operator and the slit operator belong to the scale-down class operator, and therefore are replaced by the slice operator. According to embodiments of the present disclosure, the two slice operators may be combined into one slice operator, and the dimension operations of the two slice operators also need to be combined into one slice operator.
The first slice operator corresponds to the original chunk operator, and it is assumed that the dimension operation implemented by the first slice operator is to segment the dim0 dimension of the input tensor data D into 2 blocks. Executing the chunk operator will split the dim0 dimension of tensor data D into 2 blocks as evenly as possible.
The second slice operator corresponds to the original split operator, and the dimension operation of the implementation is assumed to be that the dim1 dimension of the input tensor data D is segmented, and the size of each block is 4 as much as possible. Executing the split operator will divide the dim1 dimension of the tensor data D into blocks of size 4 as much as possible.
When the two slice operators are combined into one slice operator, the dimension operation to be realized is to divide the dim0 dimension of the input tensor data D into 2 blocks, and divide the dim1 dimension into blocks with the size of 4 as much as possible. The operation can be realized by configuring the operation parameters of the slice operator.
Alternatively or additionally, in some embodiments, the location of a particular type of target operator in the view class operator subgraph may also be adjusted to optimize processing.
In one example, the location of an extension class operator (e.g., expand operator) that results in the addition of memory data may be post-positioned. The post-processing can avoid increasing memory data in the early stage, so that the data carrying capacity of the subsequent IO operators is increased. Preferably, the expand class operator is moved as far as the final processing.
When the expansion operator is placed later, the front and back positions are required to be adjusted according to the expansion operator, and parameters of a target operator between the expansion operator and the view operator in the subgraph are modified to adapt to the position adjustment.
For example, it is assumed that the operator subgraph includes an expand operator, a permite operator, and a slice operator in order (each of the assumptions has been replaced with a target operator). The dimension operation implemented by the expansion operator is to expand the dimension size (for example, size 1= (1, 3) representing a matrix of 1 row and 3 columns) of the tensor data E into a new shape (for example, size 2= (2, 3) representing a matrix of 2 rows and 3 columns, and obtain tensor data E' by copying and expanding the tensor data E); the dimension operation realized by the permite operator is to exchange and arrange two dimension data of the expanded tensor data E 'to obtain tensor data E'; the dimension operation implemented by the slice operator is to split the tensor data E "into blocks with the size of 2×2 as much as possible, and take the first data block.
According to embodiments of the present disclosure, the expand operator may be tuned to the rearmost, thereby requiring modification of parameters of both the permite and slice operators. According to the analysis, the expand operator only increases the size of one of the dimensions (e.g., dim 0) of the tensor data E, without increasing the dimensions. Thus, the parameters of the permite operator may be unchanged, e.g. still (1, 0), indicating that dim0 is swapped with dim1 dimensions. Since the expand operator changes the dimension size of dim0, the parameters of the Slice operator need to be adjusted, for example, the original parameters can be maintained for the dimension that is not changed, while the corresponding reduction parameters are needed for the dimension that is changed, for example, to 1/2 (according to the expansion multiple of expand). That is, the dimension operation of the slice operator is modified to split the tensor data output by the permite operator into blocks of 2×1 size as much as possible, taking the first data block. Correspondingly, the parameters of the operator adjusted to the final expansion are also adjusted according to the situation, for example, the dimension size after expansion is adjusted to be size 3= (2, 2), so that the dimension operation after adjustment can be ensured to be equivalent to the dimension operation before adjustment.
After the above processing, the view class operator subgraph after the fusion processing can be returned.
As mentioned above, when tensor data is changed into memory discontinuity through view operator, conventional CPU and GPU need to perform discontinuous data access and reading through the formula described above, which results in problems of low memory access efficiency, high time consumption and the like of hardware device. In neural network computation libraries, most operators require that the input tensor be memory-continuous, otherwise mistakes are made. In this case, an operator such as configuration () needs to be called. This operator also carries the data one by one into successive stores according to the above formula. This way of handling the data one by one is very time consuming, bringing a significant time overhead to the computation of the whole computational graph.
In the embodiment of the disclosure, after the view class operator subgraphs are constructed and fused and optimized, when operators (such as calculation class operators) requiring tensor data to be continuous in the memory are encountered later, memory data continuity processing can be executed based on the pre-constructed view class operator subgraphs, and corresponding kernel is called for data carrying processing, so that the time for carrying data is reduced, and the calculation efficiency is improved.
Fig. 7 illustrates an exemplary flow chart of a data processing method according to some embodiments of the present disclosure.
As shown, in step 710, in response to the tensor data to be processed being non-contiguous in memory, a view class operator subgraph of the tensor data is obtained. The view-class operator subgraphs of tensor data are constructed and optimized, for example, according to the methods described above.
In some embodiments, an is_configuration function may be utilized to determine whether tensor data is continuous on memory. If the tensor data is continuous, no additional processing is required. If the tensor data is discontinuous, a view class operator subgraph associated with the tensor data may be obtained.
It will be appreciated that if there is no view class operator subgraph associated with the tensor data, then the tensor data can only be made continuous in the existing manner, e.g., by calling a configuration function, and then moving from data to data.
Next, in step 720, according to the obtained information of the view class operator subgraph, the corresponding kernel is called to perform data handling processing, so as to convert the tensor data into tensor data that is continuous in the memory.
Specifically, in order to avoid time overhead caused by data handling, operator types in view operator subgraphs can be analyzed, and kernel matched with the operator types are called to carry out data handling, wherein the kernel carries out handling on data blocks according to the operator types.
As mentioned in the previous operator fusion process, basically only three kinds of view class operators can exist in the fused view class operator subgraph: permute, slice and expand. For each view class operator, an appropriate kernel may be selected from the high-performance computing library for the corresponding data handling process. The kernel may implement the functionality of the corresponding operator. For example, for the permite operator, a transmissise kernel in a high performance compute library (e.g., CNNL) may be called to implement the data rearrangement function. For another example, for an expand operator, expand kernel in CNNL may be invoked to implement the data expansion function.
Therefore, according to the view class operator sub-graph sequence, each view class operator is traversed to call kernel, and tensor data can be converted from the memory discontinuous state to the memory continuous state.
Compared with the previous data carrying one by one, the processing time can be greatly shortened by calling the kernel to carry out data carrying according to the blocks, and the memory access efficiency is improved.
The method for constructing the view type operator subgraph, the method for optimizing the operator fusion and the method for processing the continuity of the memory data based on the view type operator subgraph according to the embodiment of the disclosure are described above with reference to the drawings. The present disclosure also provides a computing device that may be used to construct a view-type operator subgraph, optimize an operator subgraph, or perform memory data continuity processing.
Fig. 8 illustrates a block diagram of a hardware configuration of a computing device 800 in which various aspects of embodiments of the disclosure may be implemented. As shown, computing device 800 may include a processor 810 and a memory 820. In the computing apparatus 800 of fig. 8, only constituent elements related to the present embodiment are shown. Thus, it will be apparent to those of ordinary skill in the art that: computing device 800 may also include common constituent elements that differ from those shown in fig. 8, such as: a display.
The computing apparatus 800 may correspond to a computing device having various processing functions, for example, functions for compiling a computation graph. For example, the computing apparatus 800 may be implemented as various types of devices, such as a Personal Computer (PC), a server device, a mobile device, and so forth.
A processor 810 configured to execute program instructions to control all functions of the computing device 800. For example, the processor 810 controls all functions of the computing device 800 by executing programs stored in the memory 820 on the computing device 800. The processor 810 may be implemented by a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an Application Processor (AP), an artificial intelligence processor chip (IPU), etc., provided in the computing device 800. However, the present disclosure is not limited thereto.
Memory 820 is hardware for storing various data processed in computing device 800. For example, memory 820 may store processed data and data to be processed in computing device 800. The memory 820 may store data that has been processed or is to be processed by the processor 810, such as a computational graph before compilation, a computational graph after compilation, and the like. In addition, the memory 820 may store program instructions of applications, drivers, etc. to be driven by the computing device 800. For example: the memory 820 may store various programs related to an optimization algorithm or the like of a calculation map to be executed by the processor 810. The memory 820 may be a DRAM, but the present disclosure is not limited thereto. The memory 820 may include at least one of volatile memory or nonvolatile memory. The nonvolatile memory may include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, phase change RAM (PRAM), magnetic RAM (MRAM), resistive RAM (RRAM), ferroelectric RAM (FRAM), and the like. Volatile memory can include Dynamic RAM (DRAM), static RAM (SRAM), synchronous DRAM (SDRAM), PRAM, MRAM, RRAM, ferroelectric RAM (FeRAM), and the like. In an embodiment, the memory 820 may include at least one of a Hard Disk Drive (HDD), a Solid State Drive (SSD), a high density flash memory (CF), a Secure Digital (SD) card, a Micro-secure digital (Micro-SD) card, a Mini-secure digital (Mini-SD) card, an extreme digital (xD) card, a cache (caches), or a memory stick.
In summary, the specific functions implemented by the memory 820 and the processor 810 of the computing device 800 provided in the embodiments of the present disclosure may be explained in comparison with the previous embodiments in the present disclosure, and the technical effects of the previous embodiments may be achieved, which will not be repeated here.
In an embodiment of the present disclosure, there is also provided a computer-readable storage medium in which program instructions are stored, which when loaded and executed by a processor, cause the processor to perform the optimization method or the data processing method of the computation graph described in the embodiment of the present disclosure.
In an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program or instructions which, when executed by a processor, implements an optimization method or a data processing method according to the computational graph described in the embodiment of the present disclosure.
Fig. 9 is a block diagram illustrating a combination processing apparatus 900 according to an embodiment of the disclosure. As shown, the combined processing device 900 includes a computing device 902, an interface device 904, other processing devices 906, and a storage device 908. Depending on the context of the application, one or more computing devices 910 may be included in the computing processing device, which may be configured as computing device 800 shown in FIG. 8 for performing the operations described herein in connection with the figures.
In various embodiments, the computing processing means of the present disclosure may be configured to perform user-specified operations. In an exemplary application, the computing processing device may be implemented as a single-core artificial intelligence processor or as a multi-core artificial intelligence processor. Similarly, one or more computing devices included within a computing processing device may be implemented as an artificial intelligence processor core or as part of a hardware architecture of an artificial intelligence processor core. When multiple computing devices are implemented as artificial intelligence processor cores or portions of hardware structures of artificial intelligence processor cores, the computing processing devices of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure.
In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively accomplish user-specified operations. Depending on the implementation, other processing devices of the present disclosure may include one or more types of processors among general-purpose and/or special-purpose processors such as central processing units (Central Processing Unit, CPU), graphics processors (Graphics Processing Unit, GPU), artificial intelligence processors, and the like. These processors may include, but are not limited to, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), field programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing processing device of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing processing devices and other processing devices are considered together, both may be considered to form heterogeneous multi-core structures.
In one or more embodiments, the other processing device may interface with external data and controls as a computing processing device of the present disclosure (which may be embodied as an associated computing device for artificial intelligence, such as neural network operations), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In other embodiments, other processing devices may also cooperate with the computing processing device to jointly accomplish the computational tasks.
In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing device may obtain input data from other processing devices via the interface device, and write the input data to a storage device (or memory) on the computing device. Further, the computing processing device may obtain control instructions from other processing devices via the interface device, and write the control instructions into a control cache on the computing processing device chip. Alternatively or in addition, the interface device may also read data in a memory device of the computing processing device and transmit it to the other processing device.
Additionally or alternatively, the combined processing apparatus of the present disclosure may further comprise a storage device. As shown in the figure, the storage means are connected to the computing processing means and the other processing means, respectively. In one or more embodiments, a storage device may be used to store data for the computing processing device and/or the other processing devices. For example, the data may be data that cannot be stored entirely within an internal or on-chip memory device of a computing processing device or other processing device.
In some embodiments, the present disclosure also discloses a chip (e.g., chip 1002 shown in fig. 10). In one implementation, the Chip is a System on Chip (SoC) and is integrated with one or more combined processing devices as shown in fig. 9. The chip may be connected to other related components by an external interface device (such as external interface device 1006 shown in fig. 10). The relevant component may be, for example, a camera, a display, a mouse, a keyboard, a network card, or a wifi interface. In some application scenarios, other processing units (e.g., video codecs) and/or interface modules (e.g., DRAM interfaces) etc. may be integrated on the chip. In some embodiments, the disclosure also discloses a chip packaging structure including the chip. In some embodiments, the disclosure further discloses a board card, which includes the chip package structure described above. The board will be described in detail with reference to fig. 10.
Fig. 10 is a schematic diagram illustrating the structure of a board 1000 according to an embodiment of the disclosure. As shown, the board includes a memory device 1004 for storing data, which includes one or more memory cells 1010. The memory device may be connected to and data transferred from the control device 1008 and the chip 1002 described above by means of, for example, a bus. Further, the board card also includes an external interface device 1006 configured for data relay or transfer functions between the chip (or chips in a chip package structure) and an external device 1012 (e.g., a server or computer, etc.). For example, the data to be processed may be transferred by the external device to the chip through the external interface means. For another example, the calculation result of the chip may be transmitted back to the external device via the external interface device. The external interface device may have different interface forms according to different application scenarios, for example, it may use a standard PCIE interface or the like.
In one or more embodiments, the control device in the disclosed board card may be configured to regulate the state of the chip. For this purpose, in an application scenario, the control device may include a single chip microcomputer (Micro Controller Unit, MCU) for controlling the working state of the chip.
From the above description in connection with fig. 9 and 10, those skilled in the art will appreciate that the present disclosure also discloses an electronic device or apparatus that may include one or more of the above-described boards, one or more of the above-described chips, and/or one or more of the above-described combination processing apparatuses.
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.
It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.
In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.
In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.
In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.
The foregoing may be better understood in light of the following clauses:
clause 1, a method for optimizing a computational graph, comprising:
obtaining a view class operator subgraph of tensor data in the calculation graph, wherein the view class operator subgraph comprises a view class source operator associated with the tensor data;
replacing the source operator in the view operator subgraph with a target operator with a specified function capable of being replaced by each other according to the function of the source operator in the view operator subgraph; and
and fusing a plurality of continuous identical target operators into a single target operator to generate a fused view class operator subgraph.
Clause 2, the method of clause 1, wherein fusing consecutive identical multiple target operators into a single target operator comprises:
the dimension operations of the plurality of target operators are combined, so that the single target operator after the combination is equivalent to the plurality of target operators before the combination.
Clause 3, the method of any of clauses 1-2, further comprising:
after the fusion is executed, the position of a specific type of target operator is adjusted to post-process, wherein the specific type of target operator is an extension operator which leads to the increase of memory data.
Clause 4, the method of clause 3, wherein the adjusting the position of the specific type of target operator to post-process comprises:
And modifying parameters of the target operators in the view operator subgraph between the two according to the positions before and after the adjustment of the target operators of the specific type so as to adapt to the adjustment.
Clause 5, the method of any of clauses 1-4, wherein the functions of the source operator are classified into three types of functions of scale reduction, scale expansion and scale invariance according to the scale influence on the memory data.
Clause 6, the method of clause 5, wherein the target operators corresponding to the three functions of scale reduction, scale expansion and scale invariance are respectively: a slice operator, an expand operator, and a permite operator.
Clause 7, the method of any of clauses 1-6, further comprising:
and calling a corresponding kernel to carry out data carrying processing according to the information of the fused view class operator subgraph so as to convert the tensor data into tensor data which is continuous in the memory.
Clause 8, a computing device for optimizing a computational graph, comprising:
a processor configured to execute program instructions; and
a memory configured to store the program instructions that, when loaded and executed by the processor, cause the processor to perform the optimization method of the computational graph according to any of clauses 1-7.
Clause 9, a computer readable storage medium, having stored therein program instructions, which when loaded and executed by a processor, cause the processor to perform the optimization method of the computational graph according to any of clauses 1-7.
Clause 10, a computer program product comprising a computer program or instructions which, when executed by a processor, implement the optimization method of the computational graph of any of clauses 1-7.
While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims (10)

1. A method of optimizing a computational graph, comprising:
obtaining a view class operator subgraph of tensor data in the calculation graph, wherein the view class operator subgraph comprises a view class source operator associated with the tensor data;
Replacing the source operator in the view operator subgraph with a target operator with a specified function capable of being replaced by each other according to the function of the source operator in the view operator subgraph; and
and fusing a plurality of continuous identical target operators into a single target operator to generate a fused view class operator subgraph.
2. The method of claim 1, wherein fusing consecutive identical multiple target operators into a single target operator comprises:
the dimension operations of the plurality of target operators are combined, so that the single target operator after the combination is equivalent to the plurality of target operators before the combination.
3. The method of any of claims 1-2, further comprising:
after the fusion is executed, the position of a specific type of target operator is adjusted to post-process, wherein the specific type of target operator is an extension operator which leads to the increase of memory data.
4. A method according to claim 3, wherein said adjusting the position of a particular type of target operator for post-processing comprises:
and modifying parameters of the target operators in the view operator subgraph between the two according to the positions before and after the adjustment of the target operators of the specific type so as to adapt to the adjustment.
5. The method of any of claims 1-4, wherein the functions of the source operator are classified into three types of functions of scale reduction, scale expansion and scale invariance according to the scale influence on the memory data.
6. The method of claim 5, wherein the target operators corresponding to the three functions of scale-down, scale-up and scale-invariant are respectively: a slice operator, an expand operator, and a permite operator.
7. The method of any of claims 1-6, further comprising:
and calling a corresponding kernel to carry out data carrying processing according to the information of the fused view class operator subgraph so as to convert the tensor data into tensor data which is continuous in the memory.
8. A computing device for optimizing a computational graph, comprising:
a processor configured to execute program instructions; and
a memory configured to store the program instructions, which when loaded and executed by the processor, cause the processor to perform the optimization method of the computational graph according to any one of claims 1-7.
9. A computer readable storage medium having stored therein program instructions which, when loaded and executed by a processor, cause the processor to perform the optimization method of the computational graph according to any of claims 1-7.
10. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of optimizing a computational graph according to any one of claims 1 to 7.
CN202111433244.5A 2021-11-29 2021-11-29 Optimization method and device for calculation graph and related product Pending CN116185377A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111433244.5A CN116185377A (en) 2021-11-29 2021-11-29 Optimization method and device for calculation graph and related product
PCT/CN2022/132745 WO2023093623A1 (en) 2021-11-29 2022-11-18 Computation graph optimization method, data processing method and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111433244.5A CN116185377A (en) 2021-11-29 2021-11-29 Optimization method and device for calculation graph and related product

Publications (1)

Publication Number Publication Date
CN116185377A true CN116185377A (en) 2023-05-30

Family

ID=86438868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111433244.5A Pending CN116185377A (en) 2021-11-29 2021-11-29 Optimization method and device for calculation graph and related product

Country Status (1)

Country Link
CN (1) CN116185377A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117289948A (en) * 2023-11-24 2023-12-26 北京壁仞科技开发有限公司 Operator elimination method, device, system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117289948A (en) * 2023-11-24 2023-12-26 北京壁仞科技开发有限公司 Operator elimination method, device, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6977239B2 (en) Matrix multiplier
CN109219821A (en) Arithmetic unit and method
JP2020518042A (en) Processing device and processing method
WO2023093623A1 (en) Computation graph optimization method, data processing method and related product
US20230367722A1 (en) Data processing device and method, and related products
CN112463159B (en) Compiling method, compiling device, electronic equipment and storage medium
CN111860807B (en) Fractal calculation device, fractal calculation method, integrated circuit and board card
CN112070202B (en) Fusion graph generation method and device and computer readable storage medium
CN112799599B (en) Data storage method, computing core, chip and electronic equipment
CN114580606A (en) Data processing method, data processing device, computer equipment and storage medium
WO2023071238A1 (en) Computational graph compiling and scheduling methods and related products
CN112463160A (en) Compiling method, compiling device, electronic equipment and storage medium
CN115221102B (en) Method for optimizing convolution operation of system-on-chip and related product
CN113469336A (en) Compiling method and execution method for optimizing neural network model and related products
CN116185377A (en) Optimization method and device for calculation graph and related product
WO2023030507A1 (en) Compilation optimization method and apparatus, computer device and storage medium
Sun et al. Efficient tensor cores support in tvm for low-latency deep learning
WO2022253075A1 (en) Compilation method and related apparatus
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
CN115840894A (en) Method for processing multidimensional tensor data and related product thereof
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
Peng et al. OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation
Hart et al. SelectionConv: Convolutional Neural Networks for Non-rectilinear Image Data
CN116185274A (en) Data processing method, computing device and related products
WO2022111013A1 (en) Device supporting multiple access modes, method and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination