WO2023141939A1

WO2023141939A1 - Method and device for processing computing task

Info

Publication number: WO2023141939A1
Application number: PCT/CN2022/074576
Authority: WO
Inventors: 柯继伟; 俞郑中
Original assignee: 华为技术有限公司
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2023-08-03
Also published as: WO2023141939A9; CN116888601A

Abstract

Embodiments of the present application provide a method and device for processing a computing task. The method comprises: determining a first operator used for executing a computing task, the first operator comprising N segmentable axes, and N being a positive integer greater than or equal to 1; obtaining segmentation information of the first operator from an operator segmentation information base; segmenting an input tensor of the first operator according to the segmentation information of the first operator, so as to obtain K groups of input tensors; and respectively sending the K groups of input tensors to K target computing resources, such that the K target computing resources complete the computing task. Therefore, a graph optimizer can perform automatic segmentation of input and output tensors of an operator without the principle of a specific operator, and then complete decoupling of the graph optimizer and an operator optimization module is achieved, so that the operator corresponding to the computing task is computed in parallel on the plurality of computing resources.

Description

Method and device for processing computing tasks

technical field

The embodiments of the present application relate to the field of artificial intelligence, and more specifically, to a method and device for processing computing tasks.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. Artificial intelligence enables machines to have the functions of perception, reasoning and decision-making by studying the design principles and implementation methods of various intelligent machines. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.

Deep learning open source software frameworks such as tensorflow, pytorch, mxnet, etc. provide users with a friendly programming environment for deep learning models, allowing users to easily deploy the designed deep learning model on the central processing unit (CPU), image General-purpose computer hardware platforms such as graph processing units (GPUs). If a designed deep learning model is deployed on a specified device, the forward reasoning framework of the specified device manufacturer is generally used, for example, TensorRT is used in Nvidia's GPU. If a well-designed deep learning model needs to run on multiple different types of devices, the deep learning compiler can be used to generate effective code for the model described by the deep learning framework on different types of devices.

Deep learning compilers usually improve the running performance of models on different hardware through graph optimization and operator optimization. These two optimizations are usually relatively decoupled and independent of each other. However, the implementation of graph optimization often needs to be based on the principle of the operator itself to obtain a suitable parallel strategy for operator optimization. Therefore, it is an urgent problem to be solved how the graph optimizer performs automatic operator segmentation without being based on the principle of specific operators.

Contents of the invention

The embodiment of the present application provides a method and device for processing computing tasks, so that the graph optimizer can automatically split the input and output tensors of operators without the principle of specific operators, and then realize the graph optimizer and operator optimization module The complete decoupling of computing tasks enables operators corresponding to computing tasks to be computed in parallel on multiple computing resources.

In the first aspect, a method for processing computing tasks is provided, the method is executed by a graph optimizer, including: determining a first operator for performing computing tasks, the first operator includes N splittable axes, and N is A positive integer greater than or equal to 1; the segmentation information of the first operator is obtained from the operator segmentation information library, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in Axis type and first position information in the first operator, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n=1,...,N ;According to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2; send K sets of input tensors to K target computing resources, so that the K target computing resources can complete computing tasks.

It should be understood that the N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.

It should also be understood that the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, etc.; computing tasks may also be computing tasks in the field of big data processing, and may also be It is a computing task in the field of high-performance computing (HPC), which is not limited in this application. Correspondingly, the input tensor of the first operator corresponding to the computing task can be the input tensor corresponding to the computing task in any of the above fields. For example, when the computing task is an image processing task, the input tensor of the first operator represents the image related data.

Since the current segmentation of the input tensor of the operator is determined by the algorithm engineer at the application layer through the script language according to the segmentation axis included in a certain operator type, it is impossible to realize the segmentation of the input tensor of the operator. Automatic segmentation. In the embodiment of this application, the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, therefore The graph optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, so as to realize the complete decoupling of graph optimization and operator optimization, so that the computing tasks correspond to The operators of are calculated in parallel on multiple computing resources.

In a possible implementation, the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping The axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator calculates the elements in the input tensor of the operator The axis of the sliding window scan operation is the sliding window axis.

In a possible implementation manner, the target segmentation axis is determined, and the target segmentation axis is one of the N possible segmentation axes; according to the segmentation information of the first operator, it is determined that the target segmentation axis is in the first The segmentation method corresponding to the axis type in the operator; according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain K groups Input tensor.

In the embodiment of this application, the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.

In a possible implementation, the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator to obtain K sets of input tensors Including: according to the splitting method, determining Q first input tensors including the target splitting axis in the first operator and the position of the target splitting axis in each first input tensor among the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1; according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, each of the Q first input tensors is respectively segmented Divide to get Q groups of second input tensors; get K groups of input tensors based on Q groups of second input tensors and the undivided input tensors of the first operator.

Wherein, each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in Q groups of second input tensors is the qth group of Q first input tensors A first input tensor is divided into K segmentation results, where q=1,...,Q.

Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.

In a possible implementation, in the case that the operator used to perform the computing task further includes a second operator, the second operator includes P splittable axes, and the P splittable axes are N splittable axes. The subset of splitting axes, according to the splitting information of the first operator, splits the input tensor of the first operator, and obtaining K sets of input tensors includes: obtaining the second operator from the operator splitting information library The segmentation information of the second operator includes the axis type and the second position information of the p-th slicable axis among the P slicable axes in the second operator, where the second is used for Indicates the position of the p-th splitting axis in the input tensor of the second operator, which is the output tensor of the first operator, where P is greater than or equal to 1 and less than or equal to N is a positive integer, p=1,...,P; according to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the pth of the P segmentation reference information Segmentation reference information includes: the axis type of the p-th severable axis in the first operator, the axis type of the p-th severable axis in the second operator, the p-th severable axis in the first The position in the input tensor of the operator; according to the P segmentation reference information, determine the P group of candidate segmentation methods, wherein, the p-th group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method; according to P The time required for each segmentation method in the group candidate segmentation method to complete the calculation task is determined to determine the target segmentation method; according to the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors .

Wherein, the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.

In the embodiment of this application, the graph optimizer automatically splits the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes. For For the operator, the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , based on different types of axes, the generalization ability of the segmentation method of the first input tensor of the operator is stronger. In addition, according to the axis type and the slicable The position information of the split axis on the input tensor and output tensor of the operator can flexibly select the appropriate operator splitting method.

As a possible implementation, according to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors, including: according to the target segmentation method, determining the target segmentation axis, the target segmentation The axis type of the split axis in the first operator, the axis type of the target split axis in the second operator, the Q first input tensors in the first operator including the target split axis, and the target split axis in The position in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; according to the axis type of the target split axis in the first operator and the target split axis in the second The axis type in the operator and the number of target computing resources K, respectively segment each of the Q first input tensors to obtain Q groups of second input tensors, where Q groups of second input tensors Each group of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q group of second input tensors is the qth first input tensor of the Q first input tensors. K segmentation results, where q=1,...,Q; K sets of input tensors are obtained according to Q sets of second input tensors and unsegmented input tensors of the first operator.

As a possible implementation, according to the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator and the number K of target computing resources, the Qth Each first input tensor in an input tensor is split to obtain Q groups of second input tensors including: if the axis type of the target split axis in the first operator is element axis or sliding window axis, the target split If the axis type of the axis in the second operator is an element axis or a sliding window axis, then according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, it is determined that the target segmentation axis is included in the first operator The L first output tensors of the split axis, and the position of the target split axis in each of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the target The input of the forward shape derivation function corresponding to the axis type of the splitting axis in the first operator, to obtain the third input length, the first input length is the length of the target splitting axis in each first input tensor, where, the target The length of the split axis in each first input tensor is equal; the third input length is used as the input of the forward shape derivation function corresponding to the axis type of the target split axis in the second operator, and the first output length is obtained; according to The length of the first output and the number K of target computing resources, split the L first output tensors according to the target splitting axis, and obtain L groups of second output tensors, and each group of second output tensors in L groups of second output tensors The quantity includes K second output tensors, and the l-th group of second output tensors in the L groups of second output tensors is the segmentation result of the l-th first output tensor in the L first output tensors; the L The K second output lengths corresponding to the target split axis in each group of second output tensors in each group of second output tensors are respectively used as the input of the reverse derivation function corresponding to the target split axis in the second operator's axis type, and Q group No. K third input lengths corresponding to the target splitting axis in each group of the fifth input tensors among the five input tensors, wherein, the target splitting axis corresponds to the k-th second output tensor in each group of the second output tensors in the L groups of second output tensors The lengths are equal, and the target splitting axis is equal to the length corresponding to the kth second input tensor in each group of the fifth input tensors in the Q group of fifth input tensors; the target splitting axis in each group of the fifth input tensors in the Q group of fifth input tensors corresponds to The K third input lengths of the target split axis are respectively used as the input of the reverse derivation function corresponding to the axis type in the first operator, and the K corresponding to the target split axis in each group of second input tensors in the Q group of second input tensors is obtained length of the second input, and the corresponding length of the target segmentation axis in each group of second input tensors in Q group of second input tensors is equal; For the K second input lengths corresponding to the split axes, each of the Q first input tensors is segmented according to the target split axis to obtain Q groups of second input tensors.

In the embodiment of the present application, continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, the first position information of the target segmentation axis is also used to indicate that the target segmentation axis is in According to the position in the output tensor of the first operator, according to the axis type of the target splitting axis in the first operator and the number K of target computing resources, each first input tensor among the Q first input tensors is sliced separately To obtain Q groups of second input tensors includes: according to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target segmentation axis to obtain the first output length, The first input length is the length of the target splitting axis in each first input tensor, wherein the length of the target splitting axis in each first input tensor is equal; according to the first output length and the number K of target computing resources, for The L first output tensors are split according to the target splitting axis to obtain L groups of second output tensors, and each group of second output tensors in the L groups of second output tensors includes K second output tensors; In the second output tensor, the K second output lengths corresponding to the target segmentation axis in each group of the second output tensor are respectively used as the input of the reverse derivation function of the target segmentation axis, and the target in each group of the second input tensor in the Q group of second input tensors is obtained The K second input lengths corresponding to the split axis; according to the K second input lengths corresponding to the target split axis in each group of Q second input tensors in each group of second input tensors, respectively for each of the Q first input tensors An input tensor is split according to the target splitting axis to obtain Q groups of second input tensors.

Wherein, the l-th group of second output tensors in the L groups of second output tensors is the segmentation result of the l-th first output tensor in the L first output tensors divided into K pieces.

Among them, the corresponding lengths of the target split axis in the kth second output tensor in each group of second output tensors in the L group of second output tensors are equal, and the target split axis is in the kth of each group of second input tensors in the Q group of second input tensors The corresponding lengths in the second input tensor are equal.

As a possible implementation, when the axis type of the target splitting axis in the first operator is an element axis, according to the Kth corresponding to the target splitting axis in each group of second input tensors in Q group Two input lengths, segment each first input tensor of Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors including: according to each group of Q groups of second input tensors K second input lengths corresponding to the target splitting axis in the input tensor, by scheduling the first splitting function, each first input tensor in the Q first input tensors is split according to the target splitting axis to obtain Q Set of second input tensors.

Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to the K corresponding to the target segmentation axis in each group of second input tensors in the Q group of second input tensors The second input length is to segment each first input tensor of the Q first input tensors according to the target segmentation axis, and to obtain Q groups of second input tensors includes: according to each group of Q groups of second input tensors For the K second input lengths corresponding to the target segmentation axis in the two input tensors, by scheduling the first slice function, each first input tensor in the Q first input tensors is sliced with overlap according to the target segmentation axis Points, get the second input tensor of Q group.

Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is an intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to the K corresponding to the target segmentation axis in each group of second input tensors in the Q group of second input tensors For the second input length, segment each of the first input tensors of the Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors, including: by scheduling the second segmentation function, respectively Each first input tensor in the Q first input tensors is segmented according to the target segmentation axis to obtain Q groups of third input tensors, and the Q group of third input tensors includes K third input tensors; according to Q K second input lengths corresponding to the target slicing axis in each group of second input tensors in the group of second input tensors, by scheduling the second slicing function, K third input tensors in each group of Q third input tensors The input tensor is split according to the target splitting axis to obtain the fourth input tensor of group Q; by scheduling the splicing function, the kth fourth input tensor of the fourth input tensor of the qth group of the fourth input tensor of the group Q is combined with Q The k-th third input tensor of the qth group of the third input tensors in the third input tensor of the group is spliced according to the target segmentation axis to obtain the second input tensor of the Q group.

Among them, the element corresponding to the target splitting axis in the qth group of third input tensors in the Q group of third input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors, and the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.

Among them, the elements corresponding to the target segmentation axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.

In the embodiment of this application, the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is a reduced axis, according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, Segment each of the Q first input tensors separately to obtain Q sets of second input tensors including: according to the number K of target computing resources, by calling the third segmentation function, the Q Each first input tensor in the first input tensor is divided to obtain Q groups of second input tensors.

In the embodiment of this application, since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis. The input tensor of the operator is reasonably segmented. Compared with the current operator segmentation method, since the traditional segmentation method is based on the output tensor of the specific operator, the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.

As a possible implementation, the reduction axis includes the first type of reduction axis and the second type of reduction axis, wherein the first type of reduction axis is the reduction axis for the operator to reduce the elements in the input tensor of the operator, The second type of reduction axis is the reduction axis for which the operator does not perform a reduction operation on the elements in the operator's input tensor.

As a possible implementation, the first type of reduction axis includes any of the following: the reduction sum axis, the reduction maximum value axis, the reduction minimum value axis, and the reduction average value axis; wherein, the reduction sum axis is an operator The reduction axis for summing and reducing the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform the maximum reduction operation on the elements in the operator's input tensor; the reduction minimum value axis is the operator pair The reduction axis on which the elements in the input tensor of the operator perform the minimum value reduction operation; the reduction average axis is the reduction axis on which the operator performs the average reduction operation on the elements in the input tensor of the operator.

As a possible implementation, the second type of reduction axis includes the reduction acquisition axis, which is the element index data on the operator's input tensor according to the address indicated by the element on the operator's index input tensor axis.

As a possible implementation manner, the computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.

In a second aspect, there is provided a device for processing computing tasks, which is characterized in that the device is applied to a graph optimizer, and the device includes a processor and a transmission interface: the processor is used to determine a first operator for executing a computing task, and the second An operator includes N divisible axes, and N is a positive integer greater than or equal to 1; the processor is used to obtain the segmentation information of the first operator from the operator segmentation information library, and the segmentation information of the first operator Including the axis type of the nth divisible axis in the first operator among the N divisible axes and the first position information, wherein the first position information is used to indicate that the nth divisible axis is in the first The position in the input tensor of the operator, where n=1,...,N; the processor is used to segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K groups of input tensors Quantities, where K is a positive integer greater than or equal to 2; the transmission interface is used to send K sets of input tensors to K target computing resources, so that K target computing resources can complete computing tasks.

As a possible implementation, the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping The axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator performs a sliding window on the elements in the input tensor of the operator The axis of the scanning operation is the sliding window axis.

As a possible implementation, the processor is specifically configured to: determine the target segmentation axis, which is one of the N possible segmentation axes; determine the target segmentation according to the segmentation information of the first operator The splitting method corresponding to the axis type of the axis in the first operator; according to the splitting method corresponding to the axis type of the target splitting axis in the first operator, split the input tensor of the first operator to obtain K sets of input tensors.

As a possible implementation manner, the processor is specifically configured to: determine the Qth number of the first operator that includes the target segmentation axis according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator. An input tensor and the position of the target splitting axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; according to the position of the target splitting axis in the first operator Axis type and the number K of target computing resources, segment each first input tensor in Q first input tensors respectively to obtain Q groups of second input tensors; according to Q groups of second input tensors and uncut The input tensor of the first operator of the division, and K sets of input tensors are obtained.

As a possible implementation, in the case that the operator used to perform the calculation task also includes a second operator, the second operator includes P splittable axes, and the P splittable axes are N splittable axes. A subset of splitting axes, the processor is specifically used to: obtain the splitting information of the second operator from the operator splitting information library, and the splitting information of the second operator includes the p-th of the P splittable axes The axis type and second position information of the splittable axis in the second operator, where the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, and the second operator's The input tensor is the output tensor of the first operator, where P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P; according to the segmentation information of the first operator and the first For the segmentation information of the two operators, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: the axis type of the p-th slicable axis in the first operator, The axis type of the p-th slicable axis in the second operator, and the position of the p-th slicable axis in the input tensor of the first operator; determine the P group of candidate segmentation methods according to the P segmentation reference information , wherein, the p-th group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method; according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task, determine the target segmentation Method: According to the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors.

As a possible implementation manner, the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.

As a possible implementation, the processor is specifically configured to: determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, and the target segmentation axis in the second operator according to the target segmentation method. The axis type in , the Q first input tensors including the target split axis in the first operator, and the position of the target split axis in each of the Q first input tensors, where Q is greater than Or a positive integer equal to 1; according to the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the number K of target computing resources, Q first inputs are respectively Each first input tensor in the tensor is divided to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q group of second input tensors includes K second input tensors, and the Q group of second input tensors The qth group of second input tensors in the input tensor is the segmentation result of the qth first input tensor in the Q first input tensors divided into K pieces, where, q=1,...,Q; Two input tensors and the undivided input tensors of the first operator to obtain K sets of input tensors.

As a possible implementation, if the axis type of the target split axis in the first operator is element axis or sliding window axis, the axis type of the target split axis in the second operator is element axis or sliding window axis, then the processor is specifically configured to: determine L first output tensors including the target segmentation axis in the first operator according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, And the position of the target splitting axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the axis of the target splitting axis in the first operator The input of the forward shape derivation function corresponding to the type, and the third input length is obtained. The first input length is the length of the target split axis in each first input tensor, where the target split axis is in each first input tensor. The lengths are equal; the third input length is used as the input of the forward shape derivation function corresponding to the axis type of the target segmentation axis in the second operator to obtain the first output length; according to the first output length and the number K of target computing resources , split the L first output tensors according to the target splitting axis to obtain L groups of second output tensors, each group of second output tensors in L groups of second output tensors includes K second output tensors, L The lth group of second output tensors in the group of second output tensors is the result of cutting the lth first output tensor of the L first output tensors into K; The K second output lengths corresponding to the target segmentation axis are respectively used as the input of the reverse derivation function corresponding to the axis type of the target segmentation axis in the second operator, and the target segmentation in each group of the fifth input tensors in the Q group of fifth input tensors is obtained. The K third input lengths corresponding to the sub-axis, wherein, the target split axis is equal to the length corresponding to the k-th second output tensor in each group of second output tensors in the L group of second output tensors, and the target split axis is in the Q-th group of second output tensors. The lengths corresponding to the k-th second input tensor in each group of fifth input tensors among the five input tensors are equal; the K third input lengths corresponding to the target segmentation axes in each group of fifth input tensors in the Q group of fifth input tensors are respectively used as target cuts The split axis is the input of the reverse derivation function corresponding to the axis type in the first operator, and K second input lengths corresponding to the target split axes in each group of second input tensors in the Q group of second input tensors are obtained. The target split axes are in The lengths corresponding to the kth second input tensors in each group of second input tensors in the Q group of second input tensors are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, Segment each of the Q first input tensors according to the target segmentation axis to obtain Q groups of second input tensors.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, the first position information of the target segmentation axis is also used to indicate that the target segmentation axis is in The position in the output tensor of the first operator, the processor is specifically configured to: according to the first position information of the target segmentation axis, determine the L first output tensors in the first operator including the target segmentation axis, and the target segmentation axis The position of the split axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target split axis to obtain the first An output length, the first input length is the length of the target splitting axis in each first input tensor, wherein, the length of the target splitting axis in each first input tensor is equal; according to the first output length and the target computing resource Quantity K, segment the L first output tensors according to the target segmentation axis to obtain L groups of second output tensors, and each group of second output tensors in L groups of second output tensors includes K second output tensors , the lth group of second output tensors in L groups of second output tensors is the result of cutting the lth first output tensor of L first output tensors into K; The K second output lengths corresponding to the target split axis in the output tensor are respectively used as the input of the reverse derivation function of the target split axis, and the K second output lengths corresponding to the target split axis in each group of second input tensors in the Q group of second input tensors are obtained. Two input lengths, where the target splitting axis is equal to the length corresponding to the kth second output tensor in each group of second output tensors in L groups of second output tensors, and the target splitting axis is the second in each group of second input tensors in Q group The lengths corresponding to the k-th second input tensor in the input tensor are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, respectively for each of the Q first input tensors The first input tensor is split according to the target splitting axis to obtain Q groups of second input tensors.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis, the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors The K second input lengths corresponding to the axes, by scheduling the first slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q sets of second input tensors quantity.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors For the K second input lengths corresponding to the split axis, by scheduling the first slice function, each of the Q first input tensors is segmented with overlap according to the target split axis to obtain Q groups The second input tensor.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: by dispatching the second segmentation function, each of the Q first inputs Each first input tensor in the tensor is split according to the target splitting axis to obtain Q group third input tensors, Q group third input tensors include K third input tensors; according to Q group second input tensors For the K second input lengths corresponding to the target slicing axis in each group of second input tensors, by scheduling the second slice function, K third input tensors in each group of Q third input tensors are respectively processed according to the target The splitting axis is divided to obtain the fourth input tensor of Q group; by scheduling the splicing function, the k-th fourth input tensor of the qth group of fourth input tensor in Q group and the third input tensor of Q group The kth third input tensor in the qth group of third input tensors is spliced according to the target splitting axis to obtain the Q group of second input tensors.

Wherein, the elements corresponding to the target split axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.

As a possible implementation, when the axis type of the target segmentation axis in the first operator is the reduction axis, the processor is specifically configured to: call the third segmentation function according to the number K of target computing resources, Segment each of the Q first input tensors respectively to obtain Q groups of second input tensors.

As a possible implementation manner, the target computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.

As a possible implementation, the device may further include a memory, and instructions are stored in the memory, and the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute any one of the first aspect. method in the implementation.

In a third aspect, a computer-readable medium is provided, where the computer-readable medium stores program code, where the program code is used to execute the method in any implementation manner in the first aspect.

Description of drawings

FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of operator segmentation provided in the embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application;

Fig. 4 is a schematic flow diagram of an operator segmentation method corresponding to a single operator completing a computing task provided by an embodiment of the present application;

Fig. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application;

Fig. 6 is a schematic flowchart of an operator segmentation method corresponding to multi-operator completion of computing tasks provided by the embodiment of the present application;

Fig. 7 is a schematic diagram of a splitting method of an element axis provided in the embodiment of the present application;

Fig. 8 is a schematic diagram of a splitting method of a statute sum axis provided by an embodiment of the present application;

Fig. 9 is a schematic diagram of a splitting method of the statute maximum value axis provided by the embodiment of the present application;

Fig. 10 is a schematic diagram of a splitting method of a statute mean axis provided by the embodiment of the present application;

Fig. 11 is a schematic diagram of a segmentation method of a protocol collection axis provided by the embodiment of the present application;

Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by the embodiment of the present application;

Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application;

Fig. 14 is a schematic diagram of the position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application;

Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application;

Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application;

Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application;

Fig. 18 is a schematic diagram of an operator tensor structure provided by an embodiment of the present application;

Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

The terms used in the following examples are for the purpose of describing particular examples only, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also Expressions such as "one or more" are included unless the context clearly dictates otherwise. It should also be understood that in the following embodiments of the present application, "at least one" and "one or more" refer to one, two or more than two. The term "and/or" is used to describe the association relationship of associated objects, indicating that there may be three types of relationships; for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists alone, Wherein A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship.

Reference to "one embodiment" or "some embodiments" or the like in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.

In order to facilitate the understanding of the technical solution of the present application, a brief introduction is first made to the concepts involved in the present application.

(1) Deep learning model

A deep learning model refers to a machine learning model that includes a deep neural network structure. Algorithm engineers use the deep learning framework to build a model, adjust parameters, train and optimize the model, and save the final generated network parameters and model structure together. The resulting file is a model file that can be used for forward reasoning.

The format of the model files trained by different deep learning frameworks is different, but a complete model file generally contains information such as tensor data, computing units, and calculation graphs.

(2) Tensor

Tensor is the data container of the deep learning system, which can be understood as the extension of the matrix to any latitude. A tensor containing only one number is called a scalar (Scalar), a scalar tensor, a zero-dimensional tensor, or a 0D tensor; an array of numbers is called a vector (Vector) or a one-dimensional tensor or a 1D tensor; an array of vectors is called Matrix (Matrix) or two-dimensional tensor or 2D tensor; combining multiple matrices into a new data can get a three-dimensional tensor, which can be intuitively understood as a cube composed of numbers; combining multiple three-dimensional tensors into an array, a four-dimensional tensor can be created, and so on. Deep learning generally deals with tensors from 0D to 4D, but 5D tensors may be encountered when processing video data. This 3D tensor has size 2 for the 0-axis, 1 for the 1-axis, and 3 for the 2-axis.

The shape of the tensor indicates the number of elements in each dimension of the tensor, for example, [[[1,2,3]], [[7,8,9]]] is a three-dimensional tensor, where the three-dimensional tensor The shape is (2,1,3). For another example, FIG. 18 is a schematic diagram of a tensor provided by the embodiment of the present application. The shape of the tensor shown in FIG. 18 is (4, 20, 20, 3). Assuming that the tensor shown in FIG. 18 represents a feature map, Among them, the physical meaning of the tensor shape in Figure 18 is from left to right. The batch size N of the feature map is 4, that is, 4 pictures; the height H of the feature map is 20, and the width W of the feature map is 20. That is, the picture is 20*20=400 pixels; and the channel of the feature map is 3, which is the RGB channel.

The axis of the tensor is relative to the shape of the tensor, indicating the subscript of the shape of the tensor, for example, [[[1,2],[3,4]],[[5,6] [7,8]]] is a three-dimensional tensor with a shape of (2,2,2), then the 0-axis represents the data of the first dimension: [[1,2],[3,4]] and [[5 ,6][7,8]] These two matrices; the 1 axis represents the data of the second dimension: [1,2], [3,4], [5,6] and [7,8]; the 2 axis represents Data for the third dimension: 1, 2, 3, 4, 5, 6, 7, and 8. For another example, the tensor shape shown in Figure 18 is (4,20,20,3), the 0-axis is the batch size data of the feature map, the 1-axis is the data of the feature map height, and the 2-axis is the feature map width The data and the 3 axes are the data of the feature map channel.

(3) operator

An operator (operation/operator), which can also be called an operation unit, a calculation unit or an operator, represents a symbolic operation process and is the basic unit of the mainstream deep learning framework, that is, the node in the graph. The input and output of the operation unit are tensors. All transformations learned by deep networks can be reduced to a few tensor operations on tensors of numerical data.

Common computing units include add unit, batch normalization (BatchNormalization) unit, convolution unit, gated recurrent unit (Gated Recurrent Unit), local response normalization (local response normalization, LRN) unit, long short-term memory (long short-term memory, LSTM) unit, maximum pooling (max pool) unit, sparse activation function (rectified liner uints, ReLU), recurrent neural networks (recurrent neural networks, RNN) unit and Softmax function, etc.

(4) Calculation graph

Computation graph (graph), also known as data flow graph, is defined as directed acyclic graph (DAG). Both tensor and operation unit are objects in the graph, the operation unit is the node of the graph, and the tensor is the data flowing on the edge of the graph. Acyclic means that the graph cannot have cycles, for example, a tensor x cannot be an input to a layer that generates x. The only allowed processing loops (i.e. recurrent connections) are inner loops of recurrent layers.

Most deep learning frameworks can be described using a directed acyclic graph, in which each node represents a neuron, and if the output of one node is used as the input of another node, the two nodes share a side. That is, the nodes in this computation graph represent operators, and the edges between nodes represent data dependencies between two nodes.

(5) Operator Segmentation

Operator splitting is the splitting of the input tensor and output tensor of the operator.

FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application. The deep learning compiler will be briefly introduced below in conjunction with FIG. 1 .

The deep learning compiler can be divided into the front end of the compiler, the middle end of the compiler, and the back end of the compiler. The front end of the compiler is connected to the application layer, that is, the front end of the compiler is connected to the deep learning model. The parser mainly converts the models trained under different frameworks into an internal format recognizable by the hardware, for example, converts the calculation graph of a framework such as tensorflow or caffe2 into a calculation graph of an internal recognizable format. The middle end of the compiler includes a graph optimizer and operator information, etc. The graph optimizer can also be called a graph optimization module. The middle end of the compiler assigns different computing tasks to different computing resources (for example, CPU, GPU) for subsequent model execution. The back end of the compiler is mainly to automatically produce code instructions that match different hardware. The back end of the compiler includes Sub-compiler and operator library, etc.

Deep learning compilers usually improve the performance of the model on different devices at two levels of graph optimization and operator optimization. Graph optimization and operator optimization are relatively decoupled and independent. Graph optimization is a general optimization strategy, that is, an optimization strategy that has nothing to do with specific operator types, while operator optimization is an operator-specific optimization strategy, that is, an optimization strategy that is related to specific operator types.

Typical types of operator optimization strategies include compute optimization and schedule optimization. Through manual or automatic tuning, a single specific operator can be optimized to the extreme on a specific hardware platform. For example, for a general matrix-matrix multiplication (GEMM) operator, it is usually through blocking, vectorization, loop permutation, data rearrangement (packing), multi-core Manual scheduling technologies such as multi-thread parallel (parallel) optimize the scheduling of GEMM operators, so that GEMM operators can obtain dozens of times the performance benefits on the CPU.

Typical graph optimization strategies include constant folding. Constant folding means that if all the input tensors that an operator depends on are constants, then at compile time, the operator nodes that are not related to the model operation can be calculated in advance, thereby saving the overhead of the model runtime. .

At present, there are many other graph optimization strategies, such as graph segmentation and execution order optimization, multi-die parallelism, multi-thread parallelism, and chip parallelism. These graph optimization strategies need to be based on the principle of the operator itself. If it is not based on the principle of the operator itself, parallel optimization strategies cannot be expressed in the calculation graph.

For example, graph segmentation and execution order optimization is a graph optimization strategy to reduce the memory limit of operator operation. Specifically, through the uniform segmentation of the outer loop iteration variables of the operator, based on this, the subsequent execution of the operator is adjusted. order, so that the operator can perform a large number of iterative operations locally, reduce the memory requirements of the operator, and store more intermediate results generated by local operations in the L2 cache, thereby reducing the memory limit for subsequent operations of the operator, and finally optimizing Operational performance of the overall network model. Therefore, the segmentation method of the operator is particularly important.

For example, multiple dies are parallelized. The die is a chip before packaging. The chip uses advanced packaging technology to accumulate computing power. In order to fully utilize the performance of the chip, an operator can be split into multiple dies. Reduce data interaction between different dies. Therefore, the segmentation method of the operator is particularly important.

Another example is multi-threaded parallelism. A subgraph is used as a basic operation unit, that is, a subgraph including different operators is used as a basic operation unit. When a subgraph is assigned to multiple threads for parallelism, it is necessary to Operators are segmented, that is, when a subgraph is divided into different computing resources for parallel execution, for example, to be executed in parallel on different CPUs, operators in the subgraph need to be segmented. Since data synchronization between multiple threads means that the running of a subgraph ends, when multiple threads run in parallel, the interaction between different threads is also minimized. Therefore, the segmentation of operators in subgraphs is also particularly important.

Figure 2 is a schematic diagram of operator segmentation provided by the embodiment of the present application. The tensor of the same operator can be divided into different slices, and different slices can run on different threads, different dies, or different chips. For example, as shown in Figure 2, the input tensor of operator 1 is divided into slice 1 of operator 1 and slice 2 of operator 1, and the input tensor of operator 2 is divided into slice 1 of operator 2 and slice 2 of operator 2. Slice 2 of operator 2, slice 1 of operator 1 runs on computing resource 1, and passes the running result to the computing resource 2 corresponding to slice 1 of operator 2; at the same time, slice 2 of operator 1 runs on computing resource 2 Run on computing resource 3, and pass the running result to computing resource 4 corresponding to slice 2 of operator 2 to run on, and finally splice the running results on computing resource 2 and computing resource 4.

Because most of the current graph optimization schemes related to the operator segmentation method need to be based on the principle of the operator itself, for example, graph optimization needs to perform operator segmentation based on the nature of the operator loop iteration variable, while the operator optimization in the architecture And graph optimization also needs to be relatively decoupled. Therefore, the current processing method is that the algorithm engineer artificially classifies the cyclic variables in the necessary operators, and summarizes the changes that occur after the axis segmentation of each type of cyclic variables. This enables efficient generation of auxiliary graph optimization strategies. At present, the efficient generation of graph optimization strategies is inseparable from the artificial classification of the loop variables in the necessary operators, which does not allow any operator to be automatically segmented and run.

Currently, there is a splittable axis (for example, sample axis, parameter axis, and attribute axis) based on the operator output to split the input tensor of the operator at the application layer to achieve multi-GPU through operator splitting. The effect of parallel operations.

Specifically, the sample axis, parameter axis, and attribute axis are three types of axes that can be split on the output of the operator. The sample of the operator input tensor is split according to the sample axis, that is, the operator input tensor is divided in the sample dimension. The sub-input tensor is split, and the operator input tensor split along the sample axis is allocated to different computing resources for data parallelism; the parameters of the operator input tensor are split according to the parameter axis, that is, Split the operator input tensor in the parameter dimension, and distribute the operator input tensor split along the parameter axis to different computing resources for model parallelism; the attribute axis is the operator output except the sample axis and the parameter axis For other axes, the operator input tensor is segmented according to the attribute axis of the sample, that is, the operator input tensor sample is segmented in the attribute dimension.

According to these three axes, the operator input tensor can be divided into different computing resources for calculation. It can be divided according to the three axes alone, or mixed and divided according to the combination of the three axes to achieve multiple computing resources. The effect of parallel operation. Although the current segmentation method can achieve a certain degree of automatic operator segmentation at the application layer, it still has certain limitations. First of all, currently only for the matrix multiplication operator, three dimensions of axes are defined according to the axes in the output tensor to perform operator segmentation, which cannot cover all the slicable axes and slicing methods of the operator. Secondly, the definition of these three axes is determined according to the type of the axis of the operator output tensor, that is, if there is no axis in the operator output tensor, it will not be divided from the actual situation of the operator input tensor. Segmentation. This will result in a rough segmentation of the operator input tensor, and it is impossible to accurately segment the operator segmentation and then allocate it to different computing resources for calculation; finally, this method still defines the segmentation axis and segmentation at the application layer In other words, the algorithm engineer uses the script language to determine the segmentation method based on the segmentation axis included in a certain operator type at the application layer. In this way, it is still impossible to realize the automatic segmentation of the input and output of different operators, and it is impossible to Realize the complete decoupling of graph optimization and operator optimization.

In order to solve the above problems, the embodiment of the present application proposes a method and an apparatus for processing computing tasks, which will be described in detail below with reference to FIGS. 3 to 19 .

Fig. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application.

S301. Determine a first operator for performing a computing task, where the first operator includes N divisible axes, where N is a positive integer greater than or equal to 1.

S302. Obtain the segmentation information of the first operator from the operator segmentation information library, where the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in the first operator The axis type and the first position information, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n=1,...,N.

That is, the information included in the segmentation information of the first operator can indicate that each of the N slicable axes has its own corresponding axis type in the first operator, which will be described later in conjunction with Figure 7 To Fig. 17, the different types of axes and the corresponding segmentation methods of different types of axes are described in detail. The information included in the splitting information of the first operator may also indicate on which input tensors of the first operator each splittable axis will appear and on which axis it will appear in these input tensors. For example, according to the position information of the splittable axis 1 in the first operator, it can be known that the splittable axis 1 appears in the input tensors 1 and 2 of the first operator, and the splittable axis 1 appears in the input On axis 0 in tensor 1, splittable axis 1 appears on axis 0 in input tensor 2.

S303. Segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2.

It should be understood that the number of input tensors included in each of the K sets of input tensors is the same as the number of input tensors included in the first operator.

As a possible implementation, according to the segmentation information of the first operator and the number of computing resources M, the input tensor of the first operator is segmented to obtain K sets of input tensors, where M is greater than or equal to 2 positive integer.

Among them, although the number of available computing resources is M, the graph optimizer does not necessarily need to use all computing resources. For example, according to the size of the computing task, the required target computing resource quantity K can be estimated, or the target can be randomly determined The number K of computing resources is not limited in this embodiment of the present application.

It should also be understood that each group of input tensors in the K groups of input tensors is the input tensor required by each target computing resource, for example, if the input tensor of the first operator is split, the single computing resource used to complete the computing task If a input tensor is required, then after the input tensor of the first operator is split, each target computing resource used to complete the computing task also needs a input tensor.

As a possible implementation, the target splitting axis is determined according to the splitting information of the first operator, and the input tensor of the first operator is split according to the target splitting axis to obtain K groups of input tensors. The flow of the corresponding operator segmentation method that uses a single operator to complete a computing task will be described in detail later with reference to FIG. 4 .

It should be noted that splitting the input tensor of the first operator is not splitting all the input tensors of the first operator, but splitting the input tensor including the target splitting axis. Input tensors that do not include the target split axis are sent to each target compute resource as shared input data.

As a possible implementation, if a second operator is needed to perform the computing task, the space of candidate segmentation methods is determined according to the segmentation information of the first operator, the segmentation information of the second operator, and the number of computing resources M, And according to the candidate segmentation method space, determine the target segmentation method, and according to the target segmentation method, segment the input tensor of the first operator to obtain K groups of input tensors. The process of using multiple operators to complete the corresponding segmentation method for computing tasks will be described in detail later in conjunction with FIG. 5 .

S304. Send K groups of input tensors to K target computing resources respectively, so that the K target computing resources can complete computing tasks.

It should be understood that the quantity K of target computing resources is determined according to the quantity M of computing resources.

In the embodiment of this application, the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, the graph The optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, thereby realizing the complete decoupling of graph optimization and operator optimization.

FIG. 4 is a schematic flowchart of a corresponding operator segmentation method for completing a computing task with a single operator provided in an embodiment of the present application. FIG. 4 is a specific description of a possible implementation of S303.

S401. Determine a target splitting axis, where the target splitting axis is one of N possible splitting axes.

As a possible implementation, the graph optimizer randomly selects a splittable axis as the target splitting axis, for example, the first axis of the input tensor of the first operator is used as the target splitting axis, the first axis Can be a batch axis.

As a possible implementation, the graph optimizer selects the splittable axis with the most common axes among all the input tensors of the first operator as the target splittable axis, for example, if the first operator has 3 input tensors, among which Split axis 1 appears in 3 input tensors, and splittable axis 2 appears in 2 input tensors, then splittable axis 1 can be used as the target split axis.

As a possible implementation, according to the computing time required for the segmentation task corresponding to each splittable axis, the splittable axis with the shortest computing time is selected as the target splitting axis.

As a possible implementation, the target splitting axis is determined according to the computing time required to complete the computing task by the splitting method corresponding to each splittable axis and the target amount of computing resources K. For example, if the calculation time corresponding to the computing task is completed by using the split method corresponding to the splittable axis 1 and b target computing resources, and the splitting method corresponding to the splittable axis 2 and c target computing resources is used to complete the computing task corresponding to The calculation time is the same, but the number of target computing resources b corresponding to splittable axis 1 is less than the target number of computing resources c corresponding to splittable axis 2, then select splittable axis 1 as the target splitting axis, and splittable axis The number of target computing resources corresponding to 1 is b.

S402. According to the segmentation information of the first operator, determine the segmentation mode corresponding to the axis type of the target segmentation axis in the first operator.

S403. According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, segment the input tensor of the first operator to obtain K groups of input tensors.

As a possible implementation, according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, determine the Q first input tensors including the target segmentation axis and the target segmentation axis in the first operator The position of the sub-axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.

It should be understood that the Q first input tensors are input tensors of the first operator including the target split axis.

It should be understood that the position of the target splitting axis in each of the Q first input tensors indicates which axis the target splitting axis is on in each of the first input tensors, for example, the target splitting axis is at 1 on the 0 axis of the first input tensor and on the 0 axis of the 2nd first input tensor.

As a possible implementation, according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, each of the Q first input tensors is respectively segmented, Q groups of second input tensors are obtained, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors.

It should be understood that the qth group of second input tensors in the Q group of second input tensors is the segmentation result of the qth first input tensor in the Q first input tensors divided into K pieces, where q=1,... ,Q.

It should be understood that when the number of target computing resources is K, each first input tensor including the target splitting axis will be split according to the target splitting axis, and cut into K second input tensors, and K second input tensors The two input tensors are respectively used as the input tensors of the K target computing resources. Therefore, when there are Q first input tensors including the target split axis, Q groups of second input tensors will be formed.

It should be understood that the axis type of the target segmentation axis can be an elementwise axis, a sliding window axis, and a reduce axis, and the segmentation methods of these axis types will be described below in conjunction with Figures 7 to 13 explain in detail

As a possible implementation manner, K sets of input tensors are obtained according to Q sets of second input tensors and unsegmented input tensors of the first operator.

It should be understood that each set of input tensors in the K groups of input tensors includes the unsegmented input tensors of the first operator as shared data and the second input tensors of the first operator after segmentation corresponding to each target computing resource. quantity.

Fig. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application. FIG. 5 is a specific description of another possible implementation of S303.

When the operator for executing the calculation task further includes a second operator, the second operator includes P slicable axes, and the P slicable axes are a subset of the N slicable axes.

S501. Obtain the segmentation information of the second operator from the operator segmentation information library, where the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes in the second operator The axis type and the second position information, where the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, and the input tensor of the second operator is the output of the first operator Tensor, wherein, P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P.

It should be understood that the P splittable axes are a subset representation of the N splittable axes, and the P splittable axes appear in the output tensor of the first operator, and the output tensor of the first operator is used as the second operator The sub's input tensor. That is, the P divisible axes of the second operator also appear in the N divisible axes of the first operator.

S502. According to the segmentation information of the first operator and the segmentation information of the second operator, determine P pieces of segmentation reference information, and the p-th segmentation reference information among the P segmentation reference information includes: The axis type of the slicing axis in the first operator, the axis type of the p-th slicable axis in the second operator, and the position of the p-th slicable axis in the input tensor of the first operator.

S503. Determine P groups of candidate segmentation methods according to the P pieces of segmentation reference information and the number M of computing resources, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method.

It should be understood that each group of candidate segmentation methods is a candidate segmentation method corresponding to each segmentation reference information, that is, the segmentation reference information corresponding to each slicable axis among the P slicable axes. Including at least one segmentation method in each group of candidate segmentation methods can also be understood as including M-1 segmentation methods in each group of candidate segmentation methods. For example, when the number of computing resources is 4, the target number of computing resources can be 2, 3, 4, that is, there are 3 types of target computing resource quantities, therefore, each set of candidate segmentation methods includes 3 segmentation methods.

S504. Determine a target segmentation method according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task.

As a possible implementation, the segmentation method that takes the shortest time to complete the computing task among the P group of candidate segmentation methods is determined as the target segmentation method.

Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is not large, traverse the P group of candidate segmentation methods to obtain the time required to complete the computing task in all candidate segmentation methods, and select the shortest time required to complete the computing task The segmentation method of is used as the target segmentation method. The traversal method can be through simulation, theoretical calculation, or running on actual hardware. The embodiment of the present application does not limit the traversal method.

Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is large, the target segmentation method is searched out from the P group of candidate segmentation methods, where there are multiple search methods, which can be Monte Carlo Marko A husband algorithm or a genetic algorithm, etc., the embodiment of the present application does not limit the search method.

As a possible implementation, the target segmentation method is determined according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task and the target computing resource quantity K. For example, if the calculation time corresponding to the calculation task is completed using the segmentation method 1 and d target computing resources and the computing time corresponding to the calculation task is the same as the calculation time corresponding to the segmentation method 2 and e target computing resources, but the segmentation method 1 corresponds to If the target number of computing resources d is less than the target number of computing resources e corresponding to the segmentation method, then select segmentation method 1 as the target segmentation method, and the target number of computing resources corresponding to the target segmentation method is d.

S505. Segment the input tensor of the first operator according to the target segmentation manner to obtain K groups of input tensors.

FIG. 6 is a schematic flowchart of a corresponding operator segmentation method for multi-operators to complete computing tasks provided by an embodiment of the present application. S505 will be specifically described below in conjunction with FIG. 6 .

S601. According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the target segmentation axis included in the first operator. The Q first input tensors of the split axis and the position of the target split axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.

It should be understood that the explanation of the Q first input tensors in S601 is similar to that in S402. For brevity, reference may be made to the description in S402 for details, and details are not repeated here.

S602. According to the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number K of target computing resources, separately calculate each of the Q first input tensors Segment an input tensor to obtain Q groups of second input tensors, wherein each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of Q groups of second input tensors The group of second input tensors is the segmentation result of the qth first input tensor among the Q first input tensors being divided into K pieces, where q=1,...,Q.

It should be understood that the explanation of the Q group of second input tensors in S602 is similar to that in S403. For brevity, reference may be made to the description in S403 for details, and details are not repeated here.

It should be noted that the acquisition of the Q group of second input tensors in S602 needs to be based on the axis type of the target segmentation axis in the first operator and the axis type in the second operator. The specific segmentation method will be combined with Figure 15 Give an example.

S603. Obtain K sets of input tensors according to Q sets of second input tensors and the unsegmented input tensors of the first operator.

It should be understood that the explanation of the K sets of second input tensors in S603 is similar to that in S404. For brevity, reference may be made to the description in S404, and details are not repeated here.

It should be noted that more operators may be included in the completion of the calculation task. In the embodiment of this application, the completion of the calculation task includes the first operator and the second operator as an example for detailed description. When the calculation task is completed, it is necessary to divide For operators other than the first operator and the second operator, the graph optimizer also needs to obtain the segmentation information of other operators, so as to obtain candidate segmentation methods, so as to determine the target segmentation method.

The axis type of the input tensor of the operator in the embodiment of this application, the location information of the input tensor and the output tensor of the operator that can be divided into axes, and the operators corresponding to different axis types will be described below in conjunction with Figures 7 to 17 The segmentation method is described in detail.

The axis type is the data dependency between the input tensor and the output of the operator, that is, the graph optimizer can determine the splitting method corresponding to the axis type according to the axis type of the input tensor. Therefore, when different operator inputs include the same axis type, they can have the same operator splitting method.

As a possible implementation, the axis type of the operator input tensor may include divisible axes such as element axes, reduction axes, and sliding window axes, and may also include other types of divisible axes. This is not limited.

The element axis, the reduction axis and the sliding window axis will be described in detail below with reference to FIGS. 7 to 13 . It should be noted that Figure 7 to Figure 13 are schematic diagrams of the operator segmentation method corresponding to a single operator completing a computing task, and Operator A, Operator B, and Operator C in Figure 7 to Figure 13 can all represent the first Operator, the embodiment of this application does not limit the name of the first operator.

Element (elementwise) axis: If an iteration variable in the input tensor of operator A is an element axis, then the element axis is the axis in which the elements in the input tensor and output tensor of operator A have a point-to-point mapping relationship, that is, in the output tensor The points of are on the same axis as the points of the input tensors that the output tensor depends on. For example, the input tensor is a 4D tensor of shape (5,7,9,3), where the 3-axis of the input tensor has a length of 3, and includes the data a0, a1, and a2 for the 3-axis of the input tensor , the shape of the output tensor is (4,6,8,3), where the length of the 3-axis of the output tensor is 3, including data b0, b1 and b2 for the 3-axis of the output tensor, where a0 and b0 The positions of a1 and b1 are corresponding, and the positions of a2 and b2 are corresponding, then the axis type of the 3-axis of the input tensor and the output tensor is the element axis.

FIG. 7 is a schematic diagram of a splitting method of an element axis provided in an embodiment of the present application. The steps of splitting the input tensor of operator A according to the element axis are shown in Figure 7. In FIG. 7 , operator A is used as an example of an activation function operator for illustration. The embodiment of this application does not limit the type of operator A. It should be noted that the input tensor and output tensor of the activation function operator in Figure 7 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.

Specifically, the type of the target segmentation axis in the activation function operator is the element axis, and according to the position information of the target segmentation axis in the activation function operator, it can be determined that the type of the target segmentation axis is the element axis that appears in the activation function operator The 0-axis of the first input tensor, that is, the 0-axis of length 8 is the element axis. According to the length of the first input tensor element axis, the forward shape derivation function y=f ₁ (x) by the first input tensor and the first output tensor element axis obtains the length of the first output tensor element axis, where x represents the length of the first input tensor element axis and y represents the length of the first output tensor element axis. Wherein, the forward derivation logic of the element axis is that the lengths of the element axes of the first input tensor and the first output tensor are equal. As shown in (a) of Figure 7, the first input tensor of the activation function operator is (8, 56, 56, 64), the 0 axis of the first input tensor is the element axis, according to the output tensor element axis The logic that the length is equal to the length of the input tensor element axis, the length of the first output tensor 0 axis is also 8, that is, the first output tensor is (8,56,56,64).

According to the number of target computing resources, the first output tensor is segmented according to the element axis to obtain the second output tensor of the activation function operator on each target computing resource. Then, according to the length of the element axis corresponding to the second output tensor of the operator on each computing resource, the function is derived through the reverse shape of the element axis

Reverse deduces the element-wise axis lengths of each second input tensor. When splitting the first input tensor according to the element axis, the split function (split) needs to be used. After the operation of different computing resources is completed, the second output tensor on different target computing resources needs to be spliced through the concat function. to get the first output tensor.

As shown in (b) of Figure 7, there are two target computing resources for the activation function operator operation, the 0-axis length of the first output tensor is 8, so the elements of the second output tensor on each computing resource The length of the axis is 4, that is, the first second output tensor is (4,56,56,64), the second second output tensor is (4,56,56,64), and the second output tensor is The concatenation function is used between the tensor and the first output tensor to synchronize the data. Among them, there is no intersection between the elements on the 0-axis of the first second output tensor and the elements on the 0-axis of the second second output tensor. Then, according to the inverse shape derivation function of the element axis, the length of the element axis of the second input tensor is reversely deduced to be 4, that is, the first second input tensor is (4,56,56,64), The second second input tensor is (4,56,56,64), therefore, in order to get the shape of the second input tensor, the graph optimizer divides the first input tensor by element by calling the first slicing function Axis splitting, that is, splitting according to the 0-axis of the first input tensor to obtain two second input tensors.

It should be noted that the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 7 are equal, and the lengths of the 0-axis in the two second input tensors are also equal. The second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection. In addition, the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor. The embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.

Reduce axis: If an iteration variable in the input tensor of operator B is a reduce axis, then the reduce axis is an axis that exists in the input tensor of the operator but does not exist in the output tensor of the operator or has a length of 1.

Specifically, the reduction axis can be further divided into two types. The first type of reduction axis is the reduction axis on which the operator B performs a reduction operation on the elements in the input tensor. For example, the shape of the input tensor of operator B is (2,3,4,5), where the 0-axis of the input tensor is the reduction axis and the length is 2, then after the input tensor is operated by operator B, we get The shape of the output tensor is (,3,4,5) or (1,3,4,5).

The second type of reduction axis is the reduction axis that operator B does not perform reduction operations on the elements in the input tensor. Although operator B does not perform reduction operations on the elements on the second type of reduction axis, it does not appear in the output tensor, but it appears in the input in tensor. For example, the protocol acquisition axis, specifically about the protocol acquisition axis will be described in detail in conjunction with Figure 11

The first type of reduction axes may include a reduction sum (reduceSum) axis, a reduction maximum value (reduceMax) axis, a reduction minimum value (reduceMin) axis, a reduction mean value (reduceMean) axis, and the like. It should be noted that these different types of reduction axes all have the general characteristics of reduction axes. The difference is that the first input tensor after splitting passes through operator B on different target computing resources in order to obtain the equivalent before splitting The types of functions that need to be called for the first output tensor are different. The specific segmentation methods of different types of first-type reduction axes will be described in detail below in conjunction with FIG. 8 to FIG. 10 .

FIG. 8 is a schematic diagram of a splitting method of a reduced sum axis provided by an embodiment of the present application. The steps of splitting the reduced sum axis in the first input tensor of operator B are shown in FIG. 8 . In FIG. 8 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated sum operator in Figure 8 are described by taking a single input tensor and a single output tensor as an example. In the embodiment of the present application, the input tensor and The number of output tensors is not limited.

Specifically, the type of the target split axis in the integrated sum operator is the reduced sum axis, and according to the position information of the target split axis in the integrated sum operator, it can be determined that the target split axis is the reduced sum axis that appears in Integrate the 0 axis of the first input tensor of the sum operator. As shown in (a) of Figure 8, the first input tensor of the integrated sum operator is (8, 56, 56, 64), and the axis type of the 0 axis of the first input tensor is the reduced sum axis, also That is, the 0 axis with a length of 8 is the sum axis of the statute. Therefore, according to the characteristics of the reduced sum axis, the length of the reduced sum axis of the first output tensor is 1, that is, the first output tensor is (, 56, 56, 64).

According to the number of target computing resources and the length of the sum axis in the first output tensor, by calling the third split function, the first input tensor is divided into two second input tensors according to the sum axis of the norm, and the two The second input tensor is sent to two target computing resources for operation, and two second output tensors are obtained. The first second output tensor is (, 56, 56, 64), and the second second output tensor The tensor is (, 56, 56, 64), and the first output tensor is obtained by synchronizing the data on the two target computing resources by calling the add (AddN) function. As shown in (b) of Figure 8, there are two available computing resources for integrating the sum operator operation, the length of the reduced sum axis of the first output tensor is 1, and the first output tensor of the operator on each computing resource The two output tensors are added to obtain the first output tensor, and the shape of the second output tensor is the same as that of the first output tensor, both are (, 56, 56, 64). Since there are two computing resources, the reduction sum axis of the first input tensor is divided by the segmentation operator to obtain the second input tensor, wherein the length of the reduction sum axis of the second input tensor is 4, that is, the shape of the second input tensor is (4,56,56,64).

It should be noted that the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 8 are equal, and the lengths of the 0-axis in the two second input tensors are also equal. The second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection. In addition, the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor. The embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.

FIG. 9 is a schematic diagram of a splitting method of a reduced maximum value axis provided by an embodiment of the present application. The steps of dividing the reduced maximum value axis of the first input tensor of operator B are shown in FIG. 9 . In FIG. 9 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated maximum operator in Figure 9 are described using a single input tensor and a single output tensor as an example. The number of output tensors is not limited.

For the first input tensor including the reduced maximum axis, when splitting the first input tensor of the integrated maximum operator, the main steps and the first input tensor of the integrated sum operator in Figure 8 The division of the reduced sum axis of the quantity is generally the same. Here, reference may be made to the description of the steps of dividing the first input tensor according to the reduced sum axis in FIG. 8 , and details are not repeated here.

It should be noted that, as shown in Figure 9, the type of function that needs to be called by operator B after the sum axis segmentation of the first input tensor specification and the maximum axis segmentation of the first input tensor specification Different, the second output tensor of the integrated sum operator operation is performed on the second input tensor on different target computing resources, and the first output tensor is obtained by calling the addition function for data synchronization. Perform the maximum value operator operation on the second input tensor, and perform data synchronization by calling the maximum value function to obtain the first output tensor.

For the first input tensor including the reduced minimum value axis, when the first input tensor of the integrated minimum value operator is divided according to the reduced minimum value axis, the main steps and the integrated sum operator in Figure 8 The division of the first input tensor according to the sum axis of the stipulation is generally the same. Here, reference may be made to the description of the steps of slicing the first input tensor according to the sum axis of the stipulation in FIG. 8 , and details are not repeated here.

It should be noted that the types of functions that need to be called after passing through operator B after the sum axis segmentation of the first input tensor specification and the minimum value axis segmentation of the first input tensor specification are different, and the calculations for different targets The second input tensor on the resource is integrated into the second output tensor of the sum operator operation, and the data synchronization is performed by calling the addition function to obtain the first output tensor, and the second input tensor on different computing resources is processed In the minimum value operator operation, data synchronization is performed by calling the minimum value function to obtain the first output tensor.

Fig. 10 is a schematic diagram of a splitting method of a statistic mean axis provided by an embodiment of the present application. The steps of splitting the input tensor of operator B according to the reduced mean axis are shown in Figure 10. In FIG. 10 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated average operator in Figure 10 are described using a single input tensor and a single output tensor as an example. The number of output tensors is not limited.

For the first input tensor including the reduced mean axis, when splitting the first input tensor of the integrated mean operator, the main steps and the first input tensor of the integrated sum operator in Figure 8 The division of the quantity according to the sum axis of the specification is generally the same. Here, you can refer to the description of the steps of dividing the first input tensor according to the sum of the specification in FIG.

It should be noted that, as shown in Figure 10, after the first input tensor specification minimum value axis segmentation, after operator B and the first input tensor specification sum axis segmentation, the number of functions that operator B needs to call The number is different, the second output tensor of the integrated average operator operation is performed on the second input tensor on different target computing resources, and the data synchronization is performed by calling the addition function to obtain the intermediate output tensor, and the multiplication function also needs to be called , to get the first output tensor. It should be noted that the addition function is a synchronization node that sums the integrated average axes of the second output tensors of different computing resources, and the multiplication function is the integrated average of the intermediate output tensors that have been summed and synchronized The value axis is multiplied by 1/group to get the first output tensor, where group is the number of target computing resources, for example, in Figure 10, the target computing resources are 2, then group is 2.

The second type of reduction axis includes the reduce-gather axis. Operator B indexes data on the elements of operator B’s input tensor according to the address indicated by the element on the input tensor of operator B’s index, that is, When the first input tensor contains a protocol collection axis, it is necessary to find the corresponding data in the protocol collection axis of the first input tensor according to the address on the first index input tensor (indice) as the 0 axis of the first output tensor The data. Fig. 11 is a method of dividing the collection axis provided by the embodiment of the application. The operator B is taken as the gather (gather2) operator as an example to describe in detail. The embodiment of the application does not limit the type of the operator B. It should be noted that the input tensor of the acquisition operator in Figure 11 is an index input tensor and a first input tensor, and the output tensor of the acquisition operator is explained by taking a first output tensor as an example. The embodiment of the present application does not limit the number of the first input tensor and the first output tensor of the operator.

Specifically, as shown in (a) of Figure 11, the acquisition operator has two input tensors, namely the first input tensor and the first index input tensor, wherein the first input tensor is the data input tensor , which has shape (80,64), and the first index input tensor is the input tensor including the index address, which has shape (20,). According to the segmentation information of the acquisition operator, determine the target segmentation axis as the protocol acquisition axis, and the protocol acquisition axis appears on the 0 axis of the first input tensor. According to the characteristics of the protocol acquisition axis, the first output tensor is 0 axis The data of finds the corresponding data element in the 0-axis of the first input tensor according to the index address on the first index input tensor, therefore, the shape of the first output tensor is (20,64).

According to the number of target computing resources and the length of the protocol collection axis of the first output tensor, the first output tensor is segmented according to the protocol collection axis by calling the third segmentation function to obtain two second input tensors. Each target computing resource has a corresponding second input tensor and an index input tensor obtained by biasing the first index input tensor by calling a bias function. Then each target computing resource passes through the collection operator to obtain its own second output tensor. By calling the addition function, the second output tensors on different target computing resources are added and data synchronized to obtain the first output tensor. As shown in (b) of Figure 11, there are two target computing resources for the acquisition operator operation, the first output tensor has no specification acquisition axis, the length of the 0-axis in the first output tensor and the first index input tensor 0-axis are equal in length. Since there are two computing resources, the second input tensor is obtained by calling the third slicing function to split the reduction acquisition axis of the first input tensor, wherein the length of the reduction acquisition axis of the second input tensor is is 40, that is, the shape of the first second input tensor is (40,64), and the shape of the second second input tensor is also (40,64). When each computing resource performs collection operator operations, it will obtain the same first index input tensor. Since the collection operator on each computing resource only obtains half of the data on the collection axis of the first input tensor specification, it is also It is the 1st second input tensor and the 2nd second input tensor, so the first index input tensor needs to go through the bias operator operation to ensure that the first index after the acquisition operator operation on each computing resource The correctness of the second output tensor.

It should be noted that when performing collection operator operations, since the first input tensor is divided into two parts, the address of the collection operator on each computing resource on the input tensor according to the first index is in the second input tensor When searching in the collection axis of the quantity specification to obtain the data on the 0-axis of the second output tensor, there will be a situation where the data does not exist. At this time, 0 is used as the search result, and finally the collection operator operation is performed on the two computing resources. The second output tensor of is subjected to the addition operator operation to obtain the first output tensor.

Sliding window (sliding window) axis: If an iteration variable in the input tensor of operator C is a sliding window axis, then the sliding window axis is the axis on which operator C performs a sliding window scanning operation on the elements in the input tensor of operator C, If the sliding window is larger than the step, the windows of every two adjacent scans will overlap.

If the first output tensor is divided according to the sliding window axis, when there are two target computing resources, the elements corresponding to the sliding window axis in the first output tensor are equally divided, then the first output tensor after equal division Some data on the sliding window axis also depends on the same data on the sliding window axis of the first input tensor. Therefore, there are two segmentation methods for the segmentation of the first input tensor including the sliding window axis, which will be specifically described in conjunction with FIG. 12 and FIG. 13 .

It should be noted that the forward shape derivation function y=f ₂ (x) of the sliding window axis of the first input tensor and the first output tensor, that is, the forward derivation of the first input tensor based on the length of the sliding window axis The length of the sliding window axis of an output tensor, where x represents the length of the sliding window axis of the first input tensor, and y represents the length of the sliding window axis of the first output tensor. f ₂ () is related to convolution filling value, convolution kernel size, convolution step size and convolution kernel expansion coefficient.

Inverse shape derivation function for sliding window axes of first input tensor and first output tensor

That is, reverse derivation is performed according to the length of the sliding window axis in the first output tensor, and an appropriate splitting method is determined to obtain the second output tensor and the second input tensor of each computing resource. in,

It is also related to convolution filling value, convolution kernel size, convolution step size and convolution kernel expansion coefficient.

Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by an embodiment of the present application. The method of splitting the input tensor of operator C with overlap according to the sliding window axis is shown in Figure 12. In FIG. 12 , operator C is used as an example for illustration. The embodiment of this application does not limit the type of operator C. It should be noted that the input tensor and output tensor of the convolution operator in Figure 12 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.

Specifically, the type of the target segmentation axis in the convolution operator is a sliding window axis. According to the position information of the target segmentation axis in the convolution operator, it can be determined that the target segmentation axis is a sliding window axis that appears in the convolution operator The 1-axis of the first input tensor, that is, the 1-axis of length 56 is the sliding window axis. Therefore, according to the forward shape derivation function of the sliding window axis and the length of the sliding window axis in the first input tensor of the convolution operator, the length of the sliding window axis in the first output tensor is forward derived. As shown in (a) of Figure 12, the first input tensor of the operator is (1,56,56,64), and the function is derived according to the forward shape of the sliding window axis of the first input tensor and the first output tensor , where the convolution step size is 2 and the convolution kernel size is 3, the first output tensor can be obtained as (1,28,56,64).

According to the quantity K of target computing resources and the length of the sliding window axis of the first output tensor, the first output tensor is divided according to the sliding window axis to obtain K second output tensors. Then the function is deduced from the shape inversely by the sliding window axis, and the length of the sliding window axis of the second input tensor of the convolution operator on each target computing resource is reversely derived. According to the length of the sliding window axis in each second output tensor, the first input tensor can be sliced according to the sliding window axis by calling the first slice function to obtain the second input tensor. After the second output tensor is obtained after computing with different target computing resources, the first output tensor equivalent to the operation before splitting can be obtained by calling the splicing function.

As shown in (b) of Figure 12, there are two target computing resources for the convolution operator operation, the first output tensor axis 1 is the sliding window axis, and the length is 28, so each The length of the second output tensor 1-axis on computing resources is 14. Then, the logic is derived according to the reverse shape of the sliding window axis. Since the convolution step size is 2 and the convolution kernel size is 3, the length of the second input tensor of each computing resource is 29. Subsequently, by calling the first slice function 1 and the first slice function 2, the first input tensor is sliced according to the 1-axis, and two second input tensors with a 1-axis length of 29 are obtained, wherein the two There is overlapping data on the second input tensor with a length of 29 on the 1 axis, and the data range of the 1 axis of one of the second input tensors is from 0 to 28 in the 1 axis of the first input tensor, and the other second input tensor The data range of axis 1 is from 28 to 56 in axis 1 of the first input tensor, and the 29th data in axis 1 of the first input tensor is the overlapping part of the two second input tensors.

The overlapping segmentation method shown in Figure 12 is suitable for scenarios where the input tensors after segmentation are subjected to operator operations on different computing resources and do not require frequent data synchronization. It is completely independent and can be parallelized. However, in some scenarios, the split input tensors need to be frequently synchronized after being calculated on different computing resources, which will cause the overlapping parts of the output tensors obtained on different computing resources to be spliced frequently, resulting in overlapping The portion keeps growing, causing unnecessary double counting. Therefore, the implementation of this application also provides another splitting method with no overlap of sliding window axes, as shown in FIG. 13 .

Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application. The steps of splitting the sliding window axes of the input tensor of operator C without overlap are shown in Figure 13. In FIG. 13 , operator C is used as an example for illustration. It should be noted that, as in Figure 12, the input tensor and output tensor of the convolution operator in Figure 13 are described using a single input tensor and a single output tensor as an example. The number of input tensors and output tensors is not limited.

Specifically, the step of deriving the length corresponding to the sliding window axis in the second input tensor of each target computing resource in FIG. 13 is the same as the overlapping segmentation step in FIG. 12. For details, please refer to the description in FIG. The difference in Fig. 12 is that the process of splitting the first input tensor to obtain K second input tensors is different.

Specifically, the second slicing function in (b) of Figure 13 is used to equally divide the first input tensor according to the sliding window axis, and the second slicing function 1 and the second slicing function 2 are used to obtain different target calculations The overlapping part of the sliding window axis data of the second input tensor obtained by the operation of the resource operator on which the sliding window axis data in the second output tensor is jointly dependent. The first splicing function 1 and the first splicing function 2 are used to combine the The third input tensor and the fourth input tensor of the sub-function, the second slicing function 1 and the second slicing function 2 are spliced according to the sliding window axis to obtain the second input tensor as each target computing resource, and the second splicing function It is used to splice the second output tensors that have been operated by convolution operators on different target computing resources to obtain the first output tensors.

Specifically, as shown in (b) of Figure 13, there are two target computing resources, namely computing resource 1 and computing resource 2, and the shape of the first output tensor is (1, 28, 56, 64), where the first Axis 1 of an input tensor is the sliding window axis.

By calling the second slicing function, the first input tensor is segregated according to the 1 axis, and two equally divided third output tensors are obtained, and their shapes are (1, 28, 56, 64), respectively. The 1st third input tensor and the 2nd third input tensor. By calling the second slice function 1, slice the first third input tensor according to the 1 axis to get the second fourth input tensor, whose shape is (1,1,56,64), and the second The data of the four input tensors on axis 1 is the last data of the first and third input tensor on the sliding window axis, that is, the 28th data of the first input tensor on axis 1. By calling the second slice function 2, slice the second third input tensor according to the 1 axis to get the first fourth input tensor, whose shape is (1,1,56,64), and the first The data of the four input tensors on axis 1 is the first data of the second and third input tensor on the sliding window axis, that is, the 29th data of the first input tensor on axis 1.

By calling the first splicing function 1, the first third input tensor and the first fourth input tensor are spliced according to the axis 1 to obtain the first second input tensor, whose shape is (1,29, 56,64), the data range of this second input tensor 1 axis is from 0 to 28 in the first input tensor 1 axis. Similarly, by calling the first splicing function 2, the second third input tensor and the second fourth input tensor are spliced according to the 1 axis to obtain the second second input tensor, whose shape is (1 ,29,56,64), the data range of axis 1 in the second second input tensor is from 28 to 56 in axis 1 of the first input tensor.

The above content is a detailed description of different types of axes and their corresponding segmentation methods, and different types of axes can be represented by the following data structures:

It should be noted that the types of tensor axes are not limited to those listed in the embodiment of the present application, and there may be other tensor axes and corresponding operator segmentation methods, which are not limited in the embodiments of the present application.

It should be noted that the computing resource can be GPU, CPU, bare chip or chip, etc. The embodiment of the present application does not limit the type of computing resource, and the embodiment of the present application also does not limit the number of computing resources. In the embodiment of the present application The two computing resources are just one example.

In the embodiment of this application, the graph optimizer automatically splits the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes. For For the operator, the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , the generalization ability of the segmentation method of the first input tensor of the operator based on different types of axes is stronger.

The above content is a specific description of the splitting method of the input tensor and the output tensor of a single operator. The position information in the input tensor and output tensor of an operator determines its segmentation method, and when multiple operators are required to complete the calculation task, the position information in the input tensor and output tensor of multiple operators can be split It makes it possible for the graph optimizer to cascade different operator segmentation methods into subgraphs. The position information of the splittable axis of the operator in the input tensor and the output tensor of the operator will be described in detail below with reference to FIG. 14 .

Fig. 14 is a schematic diagram of position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application.

The position information of the operator's slicable axis in the input tensor and output tensor indicates which input tensors and which output tensors the same slicable axis is on, and the same slicable axis is in the input tensor and output tensor specific location. The type of each divisible axis is one of the above-mentioned different types of axes.

It should be understood that the embodiment of the present application does not limit the number of input tensors and output tensors of the operator. Multiple input tensors can be operated by the operator to obtain multiple output tensors. There is no limit on the number of first tensor axes in input tensors and output tensors.

As shown in Figure 14, taking the convolution operator as an example, there are two input tensors for the convolution operator, which are the first feature map input tensor and the first weight input tensor, corresponding to The shapes are (8,56,56,64) and (4,3,3,64) respectively. Among them, the 0-axis, 1-axis and 2-axis of the input tensor of the first feature map are the sliding window axes, the 3-axis is the reduction axis, the 0-axis of the first weight input tensor is the element axis, and the 1-axis, 2-axis and 3-axis are The reduction axis, according to the reduction axis, does not appear on the output tensor, so the tensor axes that appear in the first output tensor are the 0 axis, 1 axis and 2 axis of the input tensor of the first feature map, and the first weight The 0 axis of the input tensor, therefore, the shape of the first output tensor is (8,56,56,4).

Combined with the above figure, the specific description of the position information of the divisible axis in the input tensor and output tensor of the operator is given. The following two specific data structures of the position information of the divisible axis in the operator are given. Split axis-centric data structures and data structures centered on input tensors and output tensors.

As a possible implementation, the data structure centered on the splittable axis includes the type of the splittable axis and the type of the input tensor in which the splittable axis appears, and the splittable axis is divided between each input tensor and Occurs in the output tensor:

Specifically, taking the addition operator as an example, one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1). After the addition operator, the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the divisible axis can be expressed as:

As a possible implementation, the data structure centered on the input tensor and output tensor includes the number of each axis in each input tensor, the number of each axis in each output tensor, and the type of axis corresponding to each number :

input_dim_name_defs:vector<vector<int>>\\Indicates the number of each axis in each input

output_dim_name_defs:vector<vector<int>>\\Indicates the number of each axis in each output

dim_slice_types:map<int,AXIS_TYPE>\\Indicates the type of axis corresponding to each number.

Specifically, taking the addition operator as an example, one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1). After the addition operator, the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the tensor axis can be expressed as:

The graph optimizer divides and cascades different operators into subgraphs according to the axis type in the input tensor of the operator and the position information of each axis in the input tensor and output tensor. Different axis position information may have different applications. The specific application of the operator segmentation method for processing computing tasks in the embodiment of the present application will be described in detail below with reference to FIG. 15 to FIG. 17 .

Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application. Scenario 1: The first operator includes the output tensor of the splittable axis as the input tensor of the second operator, and optimizes the splitting of the input tensors of multiple continuous operators.

As a possible implementation, according to the axis type of the target split axis in different operators and the position information in the first input tensor and the first output tensor of different operators, determine the split of the first input tensor Way.

Specifically, there are two types of activation function operators in Figure 15, namely the ReLU operator and the TanH operator. The graph optimizer obtains the segmentation information of the ReLU operator and the segmentation information of the TanH operator; the axis type of the corresponding target segmentation axis in the target segmentation method determined by the graph optimizer is an element axis. For the ReLU operator, the element The axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the ReLU operator. The shape of the first input tensor is (8, 56, 56, 64), and it is divided according to the above element axes way, the shape of the first output tensor is also (8,56,56,64). For the TanH operator, the element axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the TanH operator, and the shape of the first input tensor is (8,56,56,64), Therefore, the shape of the first output tensor is also (8, 56, 56, 64) according to the division method corresponding to the above-mentioned element axis.

If the position information of the element axis appearing in the input tensor and output tensor of the ReLU operator and TanH operator is unknown, it is necessary to splicing and synchronizing the second output tensor of each target computing resource after the ReLU operator to obtain the TanH operator. The first input tensor of the operator, and then split the first input tensor of the TanH operator after splicing and synchronization to obtain the second input tensor of different target computing resources. As shown in (a) of Figure 15, to complete the ReLU operator and TanH operator operations, it is necessary to call the segmentation function twice and the splicing function twice.

Since the graph optimizer knows the position information of the target splitting axis in the first input tensor and the first output tensor of different operators, when the ReLU operator and the TanH operator operate continuously, since the ReLU operator and the TanH operator The intermediate splicing operator and intermediate segmentation operator generated by the segmentation of tensors can be omitted. As shown in (b) of Figure 12, the element axes appear on the 0 axis of the input and output tensors of the ReLU operator and the TanH operator, and only one segmentation operator node and one splicing operator node are needed to realize Continuous operation of ReLU operator and TanH operator.

Specifically, according to the above-mentioned segmentation method of the element axis, by calling the segmentation function once, the first input tensor is segmented according to the element axis to obtain the second input tensor of two equally divided ReLU operators. The second output tensor of the ReLU operator is obtained through the operation of the ReLU operator on each computing resource, and the second output tensor of the ReLU operator is used as the second input tensor of the TanH operator. The TanH operator is operated to obtain the second output tensor of the TanH operator, and finally a splicing operator operation is performed to obtain the final first output tensor.

It should be noted that, in the embodiment of this application, the axis type of the target segmentation axis in the continuous operator is not limited. Here, the axis types of the target segmentation axis in the first operator and the second operator are the same , and both are element axes for illustration. The axis types of the target split axis in the continuous operator may be the same or different, which is not limited in this embodiment of the present application.

Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application. In the second scenario, the splittable axis appears on multiple input tensors and a single output tensor of a single operator.

As a possible implementation manner, according to the segmentation information of the first operator and the target number K of computing resources, the segmentation mode in the input tensor of the first operator is determined.

Specifically, take the operator as an example of an addition operator. The addition operator has two first input tensors. The shape of the first first input tensor x is (m, n), and the second first input tensor is The input tensor y has shape (m,).

According to the segmentation information of the addition operator, the type of the slicable axis 1 is an element axis, and the slicable axis 1 appears on the 0-axis of the first input tensor x and the 0-axis of the first input tensor y, and the length is m; splittable axis2 is of type element axis, and splittable axis2 appears on axis 1 of the first input tensor x with length n.

According to the segmentation information of the first operator, two operator segmentation methods can be determined. The first is to segment the input tensor including the slicable axis 1 of length m, and the second is to segment the input tensor including the length Splits the input tensor for n splittable axis 2.

As shown in (a) of FIG. 16 , splitting is performed on an input tensor including a splittable axis 1 of length m. Determine the splittable axis 1 with a length of m as the target splitting axis, and according to the position information of the splittable axis 1 in the input tensor of the addition operator, determine that the first input tensor x will be split according to the 0-axis and the first An input tensor y will be sliced along the 0 axis. According to the splitting method in which the divisible axis 1 is the element axis, the first input tensor x can be equally divided into two second input tensors x0 and x1 according to the 0-axis, and the first input tensor y can be divided into two according to the 0-axis, etc. It is divided into two second input tensors y0 and y1, which are sent to two target computing resources for addition operator operation to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.

As shown in (b) of FIG. 16 , splitting is performed on an input tensor including a splittable axis 2 of length n. Determine the splittable axis 2 with a length of n as the target splitting axis, and determine that the first input tensor x will be split according to the 1-axis according to the position information of the splittable axis 1 in the input tensor of the addition operator. According to the splitting method in which the splittable axis 1 is the element axis, the first input tensor x can be equally divided into two second input tensors x0' and x1' according to the 1-axis. Since there is no divisible axis 2 in the first input tensor y, the first input tensor y is sent as shared data to different target computing resources. Subsequently, the addition operator operation is performed on each target computing resource to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.

As a possible implementation, each target computing resource can obtain the first input tensor y by addressing, or copy the first input tensor y to each target computing resource. An input tensor y is shared in an unlimited manner.

In the embodiment of this application, according to the axis type of the slicable axis included in the slicing information of the operator and the position information of the slicable axis on the input tensor and output tensor of the operator, the appropriate Operator segmentation method.

Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application. Scenario 3, the positions of splittable axis 1 in the first input tensor and the first output tensor of the first operator are different.

As shown in (a) of Figure 17, taking the operator as a transpose operator as an example, the graph optimizer obtains the segmentation information of the transformation operator, and determines the splittable axis 1 according to the segmentation information of the transformation operator is the element axis, the position of the split axis 1 in the first input tensor, as shown in (a) of Figure 17, the split axis 1 is the 0 axis in the first input tensor, and the split axis is in the first output tensor The position of , as shown in (a) of FIG. 17 , can split the 1 axis of axis 1 in the first output tensor. The shape (56, 8, 56, 64) of the first output tensor can be derived from the shape (8, 56, 56, 64) of the first input tensor based on the forward shape inference function of the element axes.

Specifically, as shown in (b) of Figure 17, there are two target computing resources that can be used for conversion operator operations. The first output tensor is divided according to the 1-axis, and two 1-axis lengths are obtained. The second output tensor of 4, and then deduce the function according to the reverse shape of the element axis, determine the 0-axis length of the two second input tensors as 4, and call the segmentation function, for the first input whose 0-axis length is 8 The tensor is split along the 0 axis to obtain two second input tensors.

In the embodiment of this application, such a graph optimizer only needs to know the axis type of the splittable axis of the input tensor of the operator and the position information of the splittable axis between the input tensor and the output tensor, and does not need to be based on specific In the case of the type of operator, the input and output tensors of the operator can be properly segmented, and the complete decoupling of operator optimization and graph optimization can be realized.

The foregoing content is a description of the method for processing computing tasks in the embodiment of the present application. The device for processing computing tasks in the embodiment of the present application will be described below in conjunction with FIG. 19 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.

Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application. The device 1900 is applied to a graph optimizer, and the device includes: a processor 1901 and a transmission interface 1902 . Optionally, the device may further include a memory 1903 and a bus 1904 .

Wherein, the memory 1903 , the processor 1901 , and the transmission interface 1902 realize communication connection with each other through the bus 1904 .

The memory 1903 can be ROM, static storage and RAM. The memory 1903 may store a program. When the program stored in the memory 1903 is executed by the processor 1901, the processor 1901 and the communication interface 1902 are used to execute various steps of the method for processing a computing task in the embodiment of the present application.

Exemplarily, the processor 1901 is configured to determine a first operator for performing a computing task, where the first operator includes N divisible axes, and N is a positive integer greater than or equal to 1;

The processor 1901 is configured to acquire the segmentation information of the first operator from the operator segmentation information database, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes at the first The axis type in the operator and the position information of the nth slicable axis in the first operator, where the position information of the nth slicable axis in the first operator is used to indicate the nth slicable axis The position of the sub-axis in the input tensor of the first operator, where n=1,...,N.

The processor 1901 is configured to segment the input tensor of the first operator according to the segmentation information of the first operator, and determine K groups of input tensors, where K is a positive integer greater than or equal to 2.

The transmission interface 1902 is used to send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete computing tasks.

It should be understood that the above content is only an exemplary description, and the device for processing computing tasks is used to execute the methods or steps mentioned in the aforementioned method embodiments, therefore, the device for processing computing tasks corresponds to the aforementioned method embodiments . For specific content, reference may be made to the description of the foregoing method embodiments, and details are not repeated here.

The processor 1901 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs, so as to realize the functions required by the units in the device for processing computing tasks in the embodiment of the present application , or execute the method for processing computing tasks in the method embodiment of the present application.

The processor 1901 may also be an integrated circuit chip, which has a signal processing capability. During implementation, each step of the method for processing a computing task in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1901 or instructions in the form of software.

The aforementioned processor 1901 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 1903, and the processor 1901 reads the information in the memory 1903, and combines its hardware to complete the functions required by the units included in the processing computing task device of the embodiment of the application, or execute the processing of the method embodiment of the application Calculation task method.

The transmission interface 1902 implements communication between the apparatus 1900 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, the image to be processed can be obtained through the transmission interface 1902 .

The bus 1904 may include a path for transferring information between various components of the device 1900 (eg, the memory 1903, the processor 1901, the transmission interface 1902).

It should be noted that although the above-mentioned apparatus 1900 only shows a memory, a processor, and a transmission interface, in a specific implementation process, those skilled in the art should understand that the apparatus 1900 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 1900 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 1900 may also only include the devices necessary to realize the embodiment of the present application, and does not necessarily include all the devices shown in FIG. 19 .

It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

It should also be understood that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Access memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).

The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of computer program products. The computer program product comprises one or more computer instructions or computer programs. When the computer instruction or computer program is loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive.

An embodiment of the present application provides a computer-readable storage medium, which is used to store a computer program, and when the computer program is run on a computer, the computer executes the method for processing computing tasks as in the foregoing method embodiments.

An embodiment of the present application provides a computer program product, and the computer program product includes: computer program code, when the computer program code is executed, implements the method for processing computing tasks as in the foregoing method embodiments.

It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship, but it may also indicate an "and/or" relationship, which can be understood by referring to the context.

In this application, "at least one" means one or more, and "multiple" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (unit) in a, b or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.

It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A method for processing computing tasks, characterized in that the method is executed by a graph optimizer, the method comprising:

determining a first operator for performing a calculation task, the first operator including N divisible axes, where N is a positive integer greater than or equal to 1;

Obtain the segmentation information of the first operator from the operator segmentation information library, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in the Axis type and first position information in the first operator, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n =1,...,N;

Segmenting the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2;

The K groups of input tensors are respectively sent to the K target computing resources, so that the K target computing resources complete the computing task.
The method according to claim 1, wherein the axis type of the divisible axis is one of the following types: element axis, reduction axis and sliding window axis;

Wherein, the axis in which the elements in the input tensor and the output tensor of the operator have a point-to-point mapping relationship is the element axis;

If there is a first axis in the input tensor of the operator but not in the output tensor of the operator, then the first axis is the reduction axis;

The axis on which the operator performs a sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
The method according to claim 2, wherein, according to the segmentation information of the first operator, segmenting the input tensor of the first operator, and obtaining K groups of input tensors comprises:

determining a target segmentation axis, where the target segmentation axis is one of the N possible segmentation axes;

According to the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator;

According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
The method according to claim 3, wherein the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator points, obtaining the K sets of input tensors includes:

According to the splitting method, determine the Q first input tensors including the target splitting axis in the first operator and each of the Q first input tensors with the target splitting axis in the Q first input tensors A position in the input tensor, where Q is a positive integer greater than or equal to 1;

respectively splitting each of the Q first input tensors according to the axis type of the target splitting axis in the first operator and the quantity K of the target computing resources, Obtaining Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors;

The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
The method according to claim 2, wherein, in the case that the operator for performing the calculation task further includes a second operator, the second operator includes P splitable axes, and the The P slicable axes are a subset of the N slicable axes,

The step of segmenting the input tensor of the first operator according to the segmentation information of the first operator, and obtaining K sets of input tensors includes:

The segmentation information of the second operator is obtained from the operator segmentation information library, and the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes. Axis type and second position information in the second operator, wherein the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, the The input tensor of the second operator is the output tensor of the first operator, wherein P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P;

According to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: The axis type of the p-th divisible axis in the first operator, the axis type of the p-th divisible axis in the second operator, the p-th divisible axis the position of the axis in the input tensor of the first operator;

According to the P pieces of segmentation reference information, determine P groups of candidate segmentation methods, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method;

According to the time required for each segmentation method in the P group of candidate segmentation methods to complete the calculation task, determine the target segmentation method;

According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
The method according to claim 5, wherein, according to the target segmentation method, segmenting the input tensor of the first operator, and obtaining K groups of input tensors comprises:

According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, The first operator includes Q first input tensors of the target segmentation axis and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1;

According to the axis type of the target splitting axis in the first operator and the axis type of the target splitting axis in the second operator and the quantity K of the target computing resources, the Each of the Q first input tensors is segmented to obtain Q groups of second input tensors,

Wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups of second input tensors is the Q first The qth first input tensor in the input tensor is divided into K segmentation results, where q=1,...,Q;

The K sets of input tensors are determined according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
The method according to claim 4, wherein when the axis type of the target segmentation axis in the first operator is the element axis or the sliding window axis, the target segmentation axis The first position information of is also used to indicate the position of the target splitting axis in the output tensor of the first operator, and the axis type of the target splitting axis in the first operator and the The quantity K of the target computing resource is divided into each first input tensor in the Q first input tensors respectively to obtain Q groups of second input tensors including:

According to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within the L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1;

Using the first input length as the input of the forward shape derivation function of the target split axis to obtain a first output length, the first input length is the target split axis in each of the first input tensors length, wherein the target split axes are of equal length in each of the first input tensors;

According to the first output length and the number K of the target computing resources, segment the L first output tensors according to the target segmentation axis to obtain the L groups of second output tensors, and Each group of second output tensors in the L groups of second output tensors includes K second output tensors;

Using the K second output lengths corresponding to the target segmentation axis in each of the L groups of second output tensors as the input of the reverse derivation function of the target segmentation axis to obtain the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors;

According to the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors, each of the first input tensors in the Q first input tensors is respectively The tensor is split according to the target splitting axis to obtain the Q group of second input tensors.
The method according to claim 7, wherein when the axis type of the target splitting axis in the first operator is the element axis, each of the second input tensors according to the Q group Set the K second input lengths corresponding to the target splitting axis in the second input tensor, and respectively perform each of the first input tensors in the Q first input tensors according to the target splitting axis Segmentation to obtain the Q group of second input tensors includes:

According to the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first split function, the Q first Each of the first input tensors in the input tensors is split according to the target splitting axis to obtain the Q groups of second input tensors.
The method according to claim 7, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the second input tensor according to the Q group K second input lengths corresponding to the target split axis in each group of the second input tensors, each of the first input tensors in the Q first input tensors according to the target split axis Segmentation is performed to obtain the Q group of second input tensors including:

According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first slice function, respectively for the Q first input tensors Each of the first input tensors is split with overlap according to the target splitting axis to obtain the Q groups of second input tensors.
The method according to claim 7, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the second input tensor according to the Q group K second input lengths corresponding to the target split axis in each group of the second input tensors, each of the first input tensors in the Q first input tensors according to the target split axis Segmentation is performed to obtain the Q group of second input tensors including:

By scheduling the second slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q groups of third input tensors, the Q sets of third input tensors include K third input tensors;

According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the second slice function, each of the Q groups of third input tensors is respectively Grouping the third input tensors K of the third input tensors is segmented according to the target segmentation axis to obtain Q groups of fourth input tensors;

By scheduling the splicing function, the kth fourth input tensor of the qth group of the fourth input tensors in the Q group of fourth input tensors and the qth group of the third input tensors of the Q group of third input tensors The kth third input tensor in the tensor is spliced according to the target segmentation axis to obtain the Q group of second input tensors.
The method according to claim 4, wherein when the axis type of the target segmentation axis in the first operator is the reduction axis, according to the target segmentation axis in the first The axis type in the operator and the number K of the target computing resources are respectively segmented for each of the Q first input tensors to obtain Q groups of second input tensors including:

According to the quantity K of the target computing resources, each first input tensor in the Q first input tensors is respectively divided by calling a third division function to obtain Q groups of second input tensors.
The method according to claim 11, wherein the reduction axis comprises a first type of reduction axis and a second type of reduction axis, wherein the first type of reduction axis is the operator to the operator The reduction axis on which the elements in the input tensor of the operator are reduced, and the second reduction axis is the reduction axis on which the operator does not perform the reduction operation on the elements in the input tensor of the operator.
The method according to claim 12, wherein the first type of statute axis includes any one of the following: a statute sum axis, a statute maximum value axis, a statute minimum value axis, and a statute mean value axis;

Wherein, the sum axis of the reduction is the reduction axis on which the operator performs a sum reduction operation on the elements in the input tensor of the operator;

The reduction maximum axis is the reduction axis on which the operator performs a maximum reduction operation on the elements in the input tensor of the operator;

The reduction minimum axis is the reduction axis on which the operator performs a minimum reduction operation on the elements in the input tensor of the operator;

The reduction average axis is the reduction axis on which the operator performs an average reduction operation on elements in the operator's input tensor.
The method according to claim 12, wherein the second type of reduction axis includes a reduction acquisition axis, and the reduction acquisition axis is where the address indicated by the element on the input tensor of the operator is based on the index of the operator. The axis along which the element-wise index data on the input tensor to the operator is described.
The method according to any one of claims 1 to 14, wherein the target computing resource includes one of the following types:

Image processing unit GPU, central processing unit CPU, bare chip die or chip chip.
A device for processing computing tasks, characterized in that the device is applied to a graph optimizer, and the device includes a processor and a transmission interface:

The processor is configured to determine a first operator for performing a calculation task, the first operator includes N divisible axes, and N is a positive integer greater than or equal to 1;

The processor is configured to acquire segmentation information of the first operator from an operator segmentation information library, where the segmentation information of the first operator includes the n th slicable axis among the N slicable axes. Axis type and first position information of the split axis in the first operator, wherein the first position information is used to indicate that the nth splitable axis is in the input tensor of the first operator The position of , among them, n=1,...,N;

The processor is configured to, according to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2 ;

The transmission interface is configured to respectively send the K groups of input tensors to the K target computing resources, so that the K target computing resources complete the computing task.
The device according to claim 16, wherein the axis type of the divisible axis is one of the following types: element axis, reduction axis and sliding window axis;

Wherein, the axis in which the elements in the input tensor and the output tensor of the operator have a point-to-point mapping relationship is the element axis;

If there is a first axis in the input tensor of the operator but not in the output tensor of the operator, then the first axis is the reduction axis;

The axis on which the operator performs a sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
The device according to claim 17, wherein the processor is configured to, according to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K groups of input Tensors include:

The processor is used to:

determining a target segmentation axis, where the target segmentation axis is one of the N possible segmentation axes;

According to the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator;

According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
The apparatus according to claim 18, wherein the processor is specifically configured to:

According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, determine the axis type in the first operator, and the target segmentation axis is included in the first operator The Q first input tensors and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1;

respectively splitting each of the Q first input tensors according to the axis type of the target splitting axis in the first operator and the quantity K of the target computing resources, Obtaining Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors;

The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
The device according to claim 17, wherein, in the case that the operator for performing the computing task further includes a second operator, the second operator includes P splitable axes, and the The P slicable axes are a subset of the N slicable axes,

The processor is specifically used for:

The segmentation information of the second operator is obtained from the operator segmentation information library, and the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes. Axis type and second position information in the second operator, wherein the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, the The input tensor of the second operator is the output tensor of the first operator, wherein P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P;

According to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: The axis type of the p-th divisible axis in the first operator, the axis type of the p-th divisible axis in the second operator, the p-th divisible axis the position of the axis in the input tensor of the first operator;

According to the P pieces of segmentation reference information, determine P groups of candidate segmentation methods, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method;

According to the time required for each segmentation method in the P group of candidate segmentation methods to complete the calculation task, determine the target segmentation method;

According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
The device according to claim 20, wherein the processor is specifically configured to:

According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, The first operator includes Q first input tensors of the target segmentation axis and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1;

According to the axis type of the target splitting axis in the first operator and the axis type of the target splitting axis in the second operator and the quantity K of the target computing resources, the Each of the Q first input tensors is segmented to obtain Q groups of second input tensors,

Wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups of second input tensors is the Q first The qth first input tensor in the input tensor is divided into K segmentation results, where q=1,...,Q;

The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
The device according to claim 19, wherein when the axis type of the target segmentation axis in the first operator is the element axis or the sliding window axis, the target segmentation axis The first position information of is also used to indicate the position of the target segmentation axis in the output tensor of the first operator, and the processor is specifically configured to:

According to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within the L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1;

Using the first input length as the input of the forward shape derivation function of the target split axis to obtain a first output length, the first input length is the target split axis in each of the first input tensors length, wherein the target split axes are of equal length in each of the first input tensors;

According to the first output length and the number K of the target computing resources, segment the L first output tensors according to the target segmentation axis to obtain the L groups of second output tensors, and Each group of second output tensors in the L groups of second output tensors includes K second output tensors, and the lth group of second output tensors in the L groups of second output tensors is the lth group of the L first output tensors. The first output tensor is divided into K segmentation results;

Using the K second output lengths corresponding to the target segmentation axis in each of the L groups of second output tensors as the input of the reverse derivation function of the target segmentation axis to obtain the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, wherein the target split axes are described in each group of the L groups of second output tensors The lengths corresponding to the kth second output tensor in the second output tensor are equal, and the lengths corresponding to the kth second input tensor in each group of second input tensors in the Q group of second input tensors are equal to each other;

According to the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors, each of the first input tensors in the Q first input tensors is respectively The tensor is split according to the target splitting axis to obtain the Q group of second input tensors.
The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the element axis, the processor is specifically configured to:

According to the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first split function, the Q first Each of the first input tensors in the input tensors is split according to the target splitting axis to obtain the Q groups of second input tensors.
The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:

According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first slice function, respectively for the Q first input tensors Each of the first input tensors is split with overlap according to the target splitting axis to obtain the Q groups of second input tensors.
The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:

By scheduling the second slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q groups of third input tensors, the Q sets of third input tensors include K third input tensors;

According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the second slice function, each of the Q groups of third input tensors is respectively Grouping the third input tensors K of the third input tensors is segmented according to the target segmentation axis to obtain Q groups of fourth input tensors;

By scheduling the splicing function, the kth fourth input tensor of the qth group of the fourth input tensors in the Q group of fourth input tensors and the qth group of the third input tensors of the Q group of third input tensors The kth third input tensor in the tensor is spliced according to the target segmentation axis to obtain the Q group of second input tensors.
The device according to claim 19, wherein when the axis type of the target segmentation axis in the first operator is the reduction axis, the processor is specifically configured to:

According to the quantity K of the target computing resources, each first input tensor in the Q first input tensors is respectively divided by calling a third division function to obtain Q groups of second input tensors.
The device according to claim 26, wherein the reduction axis comprises a first type of reduction axis and a second type of reduction axis, wherein the first type of reduction axis is the operator to the operator The reduction axis on which the elements in the input tensor of the operator are reduced, and the second reduction axis is the reduction axis on which the operator does not perform the reduction operation on the elements in the input tensor of the operator.
The device according to claim 27, wherein the first type of statute axis includes any one of the following: a statute sum axis, a statute maximum value axis, a statute minimum value axis, and a statute average axis;

Wherein, the sum axis of the reduction is the reduction axis on which the operator performs a sum reduction operation on the elements in the input tensor of the operator;

The reduction maximum axis is the reduction axis on which the operator performs a maximum reduction operation on the elements in the input tensor of the operator;

The reduction minimum axis is the reduction axis on which the operator performs a minimum reduction operation on the elements in the input tensor of the operator;

The reduction average axis is the reduction axis on which the operator performs an average reduction operation on elements in the operator's input tensor.
The device according to claim 27, wherein the second type of reduction axis includes a reduction acquisition axis, and the reduction acquisition axis is where the address indicated by the element on the input tensor of the operator is based on the index of the operator. The axis along which the element-wise index data on the input tensor to the operator is described.
The device according to any one of claims 16 to 29, wherein the target computing resource includes one of the following types:

Image processing unit GPU, central processing unit CPU, bare chip die or chip chip.
A computer-readable storage medium, characterized in that the computer-readable medium stores program codes, and the program codes are used to execute the method according to any one of claims 1 to 15.