WO2023141939A1 - Method and device for processing computing task - Google Patents

Method and device for processing computing task Download PDF

Info

Publication number
WO2023141939A1
WO2023141939A1 PCT/CN2022/074576 CN2022074576W WO2023141939A1 WO 2023141939 A1 WO2023141939 A1 WO 2023141939A1 CN 2022074576 W CN2022074576 W CN 2022074576W WO 2023141939 A1 WO2023141939 A1 WO 2023141939A1
Authority
WO
WIPO (PCT)
Prior art keywords
axis
input
operator
tensors
tensor
Prior art date
Application number
PCT/CN2022/074576
Other languages
French (fr)
Chinese (zh)
Other versions
WO2023141939A9 (en
Inventor
柯继伟
俞郑中
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/074576 priority Critical patent/WO2023141939A1/en
Priority to CN202280012811.6A priority patent/CN116888601A/en
Publication of WO2023141939A1 publication Critical patent/WO2023141939A1/en
Publication of WO2023141939A9 publication Critical patent/WO2023141939A9/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and more specifically, to a method and device for processing computing tasks.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. Artificial intelligence enables machines to have the functions of perception, reasoning and decision-making by studying the design principles and implementation methods of various intelligent machines. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • Deep learning open source software frameworks such as tensorflow, pytorch, mxnet, etc. provide users with a friendly programming environment for deep learning models, allowing users to easily deploy the designed deep learning model on the central processing unit (CPU), image General-purpose computer hardware platforms such as graph processing units (GPUs).
  • CPU central processing unit
  • GPU GPU
  • the forward reasoning framework of the specified device manufacturer is generally used, for example, TensorRT is used in Nvidia's GPU.
  • the deep learning compiler can be used to generate effective code for the model described by the deep learning framework on different types of devices.
  • Deep learning compilers usually improve the running performance of models on different hardware through graph optimization and operator optimization. These two optimizations are usually relatively decoupled and independent of each other.
  • the implementation of graph optimization often needs to be based on the principle of the operator itself to obtain a suitable parallel strategy for operator optimization. Therefore, it is an urgent problem to be solved how the graph optimizer performs automatic operator segmentation without being based on the principle of specific operators.
  • the embodiment of the present application provides a method and device for processing computing tasks, so that the graph optimizer can automatically split the input and output tensors of operators without the principle of specific operators, and then realize the graph optimizer and operator optimization module
  • the complete decoupling of computing tasks enables operators corresponding to computing tasks to be computed in parallel on multiple computing resources.
  • N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
  • the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, etc.; computing tasks may also be computing tasks in the field of big data processing, and may also be It is a computing task in the field of high-performance computing (HPC), which is not limited in this application.
  • the input tensor of the first operator corresponding to the computing task can be the input tensor corresponding to the computing task in any of the above fields.
  • the computing task is an image processing task
  • the input tensor of the first operator represents the image related data.
  • the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, therefore The graph optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, so as to realize the complete decoupling of graph optimization and operator optimization, so that the computing tasks correspond to The operators of are calculated in parallel on multiple computing resources.
  • the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping
  • the axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator calculates the elements in the input tensor of the operator
  • the axis of the sliding window scan operation is the sliding window axis.
  • the target segmentation axis is determined, and the target segmentation axis is one of the N possible segmentation axes; according to the segmentation information of the first operator, it is determined that the target segmentation axis is in the first The segmentation method corresponding to the axis type in the operator; according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain K groups Input tensor.
  • the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
  • the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator to obtain K sets of input tensors Including: according to the splitting method, determining Q first input tensors including the target splitting axis in the first operator and the position of the target splitting axis in each first input tensor among the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1; according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, each of the Q first input tensors is respectively segmented Divide to get Q groups of second input tensors; get K groups of input tensors based on Q groups of second input tensors and the undivided input tensors of the first operator.
  • each group of second input tensors in Q groups of second input tensors includes K second input tensors
  • the qth group of second input tensors in Q groups of second input tensors is the qth group of Q first input tensors
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • the second operator includes P splittable axes, and the P splittable axes are N splittable axes.
  • the subset of splitting axes, according to the splitting information of the first operator, splits the input tensor of the first operator, and obtaining K sets of input tensors includes: obtaining the second operator from the operator splitting information library
  • the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
  • the graph optimizer automatically splits the operator input and output tensors according to different types of axes.
  • it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes.
  • the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , based on different types of axes, the generalization ability of the segmentation method of the first input tensor of the operator is stronger.
  • the position information of the split axis on the input tensor and output tensor of the operator can flexibly select the appropriate operator splitting method.
  • the input tensor of the first operator is segmented to obtain K sets of input tensors, including: according to the target segmentation method, determining the target segmentation axis, the target segmentation The axis type of the split axis in the first operator, the axis type of the target split axis in the second operator, the Q first input tensors in the first operator including the target split axis, and the target split axis in The position in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; according to the axis type of the target split axis in the first operator and the target split axis in the second The axis type in the operator and the number of target computing resources K, respectively segment each of the Q first input tensors to obtain Q groups of second input tensors, where Q groups of second input tensors Each group of second input tensors includes K second input tensor
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • the Qth Each first input tensor in an input tensor is split to obtain Q groups of second input tensors including: if the axis type of the target split axis in the first operator is element axis or sliding window axis, the target split If the axis type of the axis in the second operator is an element axis or a sliding window axis, then according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, it is determined that the target segmentation axis is included in the first operator
  • the L first output tensors of the split axis, and the position of the target split axis in each of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the target The input
  • the target splitting axis corresponds to the k-th second output tensor in each group of the second output tensors in the L groups of second output tensors
  • the lengths are equal, and the target splitting axis is equal to the length corresponding to the kth second input tensor in each group of the fifth input tensors in the Q group of fifth input tensors; the target splitting axis in each group of the fifth input tensors in the Q group of fifth input tensors corresponds to
  • the K third input lengths of the target split axis are respectively used as the input of the reverse derivation function corresponding to the axis type in the first operator, and the K corresponding to the target split axis in each group of second input tensors in the Q group of second input tensors is obtained length of the second input, and
  • continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
  • the first position information of the target segmentation axis is also used to indicate that the target segmentation axis is in
  • each first input tensor among the Q first input tensors is sliced separately
  • Q groups of second input tensors includes: according to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within L
  • the position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target segmentation axis to obtain the first output length
  • the l-th group of second output tensors in the L groups of second output tensors is the segmentation result of the l-th first output tensor in the L first output tensors divided into K pieces.
  • the corresponding lengths of the target split axis in the kth second output tensor in each group of second output tensors in the L group of second output tensors are equal, and the target split axis is in the kth of each group of second input tensors in the Q group of second input tensors
  • the corresponding lengths in the second input tensor are equal.
  • the axis type of the target splitting axis in the first operator is an element axis
  • the Kth corresponding to the target splitting axis in each group of second input tensors in Q group Two input lengths, segment each first input tensor of Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors including: according to each group of Q groups of second input tensors K second input lengths corresponding to the target splitting axis in the input tensor, by scheduling the first splitting function, each first input tensor in the Q first input tensors is split according to the target splitting axis to obtain Q Set of second input tensors.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the axis type of the target segmentation axis in the first operator is a sliding window axis
  • the second input length is to segment each first input tensor of the Q first input tensors according to the target segmentation axis, and to obtain Q groups of second input tensors includes: according to each group of Q groups of second input tensors For the K second input lengths corresponding to the target segmentation axis in the two input tensors, by scheduling the first slice function, each first input tensor in the Q first input tensors is sliced with overlap according to the target segmentation axis Points, get the second input tensor of Q group.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is an intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the axis type of the target segmentation axis in the first operator is a sliding window axis
  • the K corresponding to the target segmentation axis in each group of second input tensors in the Q group of second input tensors For the second input length, segment each of the first input tensors of the Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors, including: by scheduling the second segmentation function, respectively
  • Each first input tensor in the Q first input tensors is segmented according to the target segmentation axis to obtain Q groups of third input tensors, and the Q group of third input tensors includes K third input tensors; according to Q K second input lengths corresponding to the target slicing axis in each group of second input tensors in the group of second input tensors, by scheduling the second slicing function, K third input tens
  • the element corresponding to the target splitting axis in the qth group of third input tensors in the Q group of third input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors, and the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the elements corresponding to the target segmentation axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.
  • the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
  • the axis type of the target segmentation axis in the first operator is a reduced axis
  • Segment each of the Q first input tensors separately to obtain Q sets of second input tensors including: according to the number K of target computing resources, by calling the third segmentation function, the Q
  • Each first input tensor in the first input tensor is divided to obtain Q groups of second input tensors.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the graph optimizer since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis.
  • the input tensor of the operator is reasonably segmented.
  • the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
  • the reduction axis includes the first type of reduction axis and the second type of reduction axis, wherein the first type of reduction axis is the reduction axis for the operator to reduce the elements in the input tensor of the operator,
  • the second type of reduction axis is the reduction axis for which the operator does not perform a reduction operation on the elements in the operator's input tensor.
  • the first type of reduction axis includes any of the following: the reduction sum axis, the reduction maximum value axis, the reduction minimum value axis, and the reduction average value axis; wherein, the reduction sum axis is an operator The reduction axis for summing and reducing the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform the maximum reduction operation on the elements in the operator's input tensor; the reduction minimum value axis is the operator pair The reduction axis on which the elements in the input tensor of the operator perform the minimum value reduction operation; the reduction average axis is the reduction axis on which the operator performs the average reduction operation on the elements in the input tensor of the operator.
  • the second type of reduction axis includes the reduction acquisition axis, which is the element index data on the operator's input tensor according to the address indicated by the element on the operator's index input tensor axis.
  • the computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
  • N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
  • the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, etc.; computing tasks may also be computing tasks in the field of big data processing, and may also be It is a computing task in the field of high-performance computing (HPC), which is not limited in this application.
  • the input tensor of the first operator corresponding to the computing task can be the input tensor corresponding to the computing task in any of the above fields.
  • the computing task is an image processing task
  • the input tensor of the first operator represents the image related data.
  • the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, therefore The graph optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, so as to realize the complete decoupling of graph optimization and operator optimization, so that the computing tasks correspond to The operators of are calculated in parallel on multiple computing resources.
  • the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping
  • the axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator performs a sliding window on the elements in the input tensor of the operator
  • the axis of the scanning operation is the sliding window axis.
  • the processor is specifically configured to: determine the target segmentation axis, which is one of the N possible segmentation axes; determine the target segmentation according to the segmentation information of the first operator The splitting method corresponding to the axis type of the axis in the first operator; according to the splitting method corresponding to the axis type of the target splitting axis in the first operator, split the input tensor of the first operator to obtain K sets of input tensors.
  • the processor is specifically configured to: determine the Qth number of the first operator that includes the target segmentation axis according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator.
  • each group of second input tensors in Q groups of second input tensors includes K second input tensors
  • the qth group of second input tensors in Q groups of second input tensors is the qth group of Q first input tensors
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
  • the processor is specifically used to: obtain the splitting information of the second operator from the operator splitting information library, and the splitting information of the second operator includes the p-th of the P splittable axes.
  • the axis type and second position information of the splittable axis in the second operator where the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, and the second operator's
  • the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
  • the graph optimizer automatically splits the operator input and output tensors according to different types of axes.
  • it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes.
  • the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , based on different types of axes, the generalization ability of the segmentation method of the first input tensor of the operator is stronger.
  • the position information of the split axis on the input tensor and output tensor of the operator can flexibly select the appropriate operator splitting method.
  • the processor is specifically configured to: determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, and the target segmentation axis in the second operator according to the target segmentation method.
  • the axis type in the Q first input tensors including the target split axis in the first operator, and the position of the target split axis in each of the Q first input tensors, where Q is greater than Or a positive integer equal to 1; according to the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the number K of target computing resources, Q first inputs are respectively
  • Each first input tensor in the tensor is divided to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q group of second input tensors includes K second input tensors, and the Q group of second input tensors.
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • the processor is specifically configured to: determine L first output tensors including the target segmentation axis in the first operator according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, And the position of the target splitting axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the axis of the target splitting axis in the first operator The input of the forward shape derivation function corresponding to the type, and the third input length is obtained.
  • the first input length is the length of the target split axis in each first input tensor, where the target split axis is in each first input tensor.
  • the lengths are equal; the third input length is used as the input of the forward shape derivation function corresponding to the axis type of the target segmentation axis in the second operator to obtain the first output length; according to the first output length and the number K of target computing resources , split the L first output tensors according to the target splitting axis to obtain L groups of second output tensors, each group of second output tensors in L groups of second output tensors includes K second output tensors, L The lth group of second output tensors in the group of second output tensors is the result of cutting the lth first output tensor of the L first output tensors into K; The K second output lengths corresponding to the target segmentation axis are respectively used as the input of the reverse derivation function corresponding
  • the K third input lengths corresponding to the sub-axis wherein, the target split axis is equal to the length corresponding to the k-th second output tensor in each group of second output tensors in the L group of second output tensors, and the target split axis is in the Q-th group of second output tensors.
  • the lengths corresponding to the k-th second input tensor in each group of fifth input tensors among the five input tensors are equal; the K third input lengths corresponding to the target segmentation axes in each group of fifth input tensors in the Q group of fifth input tensors are respectively used as target cuts
  • the split axis is the input of the reverse derivation function corresponding to the axis type in the first operator, and K second input lengths corresponding to the target split axes in each group of second input tensors in the Q group of second input tensors are obtained.
  • the target split axes are in The lengths corresponding to the kth second input tensors in each group of second input tensors in the Q group of second input tensors are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, Segment each of the Q first input tensors according to the target segmentation axis to obtain Q groups of second input tensors.
  • continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
  • the processor is specifically configured to: according to the first position information of the target segmentation axis, determine the L first output tensors in the first operator including the target segmentation axis, and the target segmentation axis The position of the split axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target split axis to obtain the first An output length, the first input length is the length of the target splitting axis in each first input tensor, wherein, the length of the target splitting axis in each first input tensor is equal; according
  • the target splitting axis is equal to the length corresponding to the kth second output tensor in each group of second output tensors in L groups of second output tensors, and the target splitting axis is the second in each group of second input tensors in Q group
  • the lengths corresponding to the k-th second input tensor in the input tensor are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, respectively for each of the Q first input tensors
  • the first input tensor is split according to the target splitting axis to obtain Q groups of second input tensors.
  • the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors The K second input lengths corresponding to the axes, by scheduling the first slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q sets of second input tensors quantity.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors For the K second input lengths corresponding to the split axis, by scheduling the first slice function, each of the Q first input tensors is segmented with overlap according to the target split axis to obtain Q groups The second input tensor.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is an intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the processor is specifically configured to: by dispatching the second segmentation function, each of the Q first inputs Each first input tensor in the tensor is split according to the target splitting axis to obtain Q group third input tensors, Q group third input tensors include K third input tensors; according to Q group second input tensors For the K second input lengths corresponding to the target slicing axis in each group of second input tensors, by scheduling the second slice function, K third input tensors in each group of Q third input tensors are respectively processed according to the target The splitting axis is divided to obtain the fourth input tensor of Q group; by scheduling the splicing function, the k-th fourth input tensor of the qth group of fourth input tensor in Q group and the third input tensor of Q group.
  • the element corresponding to the target splitting axis in the qth group of third input tensors in the Q group of third input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors, and the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the elements corresponding to the target split axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.
  • the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
  • the processor is specifically configured to: call the third segmentation function according to the number K of target computing resources, Segment each of the Q first input tensors respectively to obtain Q groups of second input tensors.
  • the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors
  • the union of is the element corresponding to the target split axis in the qth first input tensor.
  • the graph optimizer since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis.
  • the input tensor of the operator is reasonably segmented.
  • the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
  • the reduction axis includes the first type of reduction axis and the second type of reduction axis, wherein the first type of reduction axis is the reduction axis for the operator to reduce the elements in the input tensor of the operator,
  • the second type of reduction axis is the reduction axis for which the operator does not perform a reduction operation on the elements in the operator's input tensor.
  • the first type of reduction axis includes any of the following: the reduction sum axis, the reduction maximum value axis, the reduction minimum value axis, and the reduction average value axis; wherein, the reduction sum axis is an operator The reduction axis for summing and reducing the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform the maximum reduction operation on the elements in the operator's input tensor; the reduction minimum value axis is the operator pair The reduction axis on which the elements in the input tensor of the operator perform the minimum value reduction operation; the reduction average axis is the reduction axis on which the operator performs the average reduction operation on the elements in the input tensor of the operator.
  • the second type of reduction axis includes the reduction acquisition axis, which is the element index data on the operator's input tensor according to the address indicated by the element on the operator's index input tensor axis.
  • the target computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
  • the device may further include a memory, and instructions are stored in the memory, and the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute any one of the first aspect. method in the implementation.
  • a computer-readable medium stores program code, where the program code is used to execute the method in any implementation manner in the first aspect.
  • FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of operator segmentation provided in the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application
  • Fig. 4 is a schematic flow diagram of an operator segmentation method corresponding to a single operator completing a computing task provided by an embodiment of the present application;
  • Fig. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application.
  • Fig. 6 is a schematic flowchart of an operator segmentation method corresponding to multi-operator completion of computing tasks provided by the embodiment of the present application;
  • Fig. 7 is a schematic diagram of a splitting method of an element axis provided in the embodiment of the present application.
  • Fig. 8 is a schematic diagram of a splitting method of a statute sum axis provided by an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a splitting method of the statute maximum value axis provided by the embodiment of the present application.
  • Fig. 10 is a schematic diagram of a splitting method of a statute mean axis provided by the embodiment of the present application.
  • Fig. 11 is a schematic diagram of a segmentation method of a protocol collection axis provided by the embodiment of the present application.
  • Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by the embodiment of the present application.
  • Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application.
  • Fig. 14 is a schematic diagram of the position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application;
  • Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application.
  • Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application.
  • Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application.
  • Fig. 18 is a schematic diagram of an operator tensor structure provided by an embodiment of the present application.
  • Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” or the like in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • a deep learning model refers to a machine learning model that includes a deep neural network structure. Algorithm engineers use the deep learning framework to build a model, adjust parameters, train and optimize the model, and save the final generated network parameters and model structure together. The resulting file is a model file that can be used for forward reasoning.
  • model files trained by different deep learning frameworks are different, but a complete model file generally contains information such as tensor data, computing units, and calculation graphs.
  • Tensor is the data container of the deep learning system, which can be understood as the extension of the matrix to any latitude.
  • a tensor containing only one number is called a scalar (Scalar), a scalar tensor, a zero-dimensional tensor, or a 0D tensor; an array of numbers is called a vector (Vector) or a one-dimensional tensor or a 1D tensor; an array of vectors is called Matrix (Matrix) or two-dimensional tensor or 2D tensor; combining multiple matrices into a new data can get a three-dimensional tensor, which can be intuitively understood as a cube composed of numbers; combining multiple three-dimensional tensors into an array, a four-dimensional tensor can be created, and so on.
  • Deep learning generally deals with tensors from 0D to 4D, but 5D tensors may be encountered when processing video data.
  • This 3D tensor has size 2 for the 0-axis, 1 for the 1-axis, and 3 for the 2-axis.
  • the shape of the tensor indicates the number of elements in each dimension of the tensor, for example, [[[1,2,3]], [[7,8,9]]] is a three-dimensional tensor, where the three-dimensional tensor The shape is (2,1,3).
  • FIG. 18 is a schematic diagram of a tensor provided by the embodiment of the present application.
  • the shape of the tensor shown in FIG. 18 is (4, 20, 20, 3). Assuming that the tensor shown in FIG. 18 represents a feature map, Among them, the physical meaning of the tensor shape in Figure 18 is from left to right.
  • the axis of the tensor is relative to the shape of the tensor, indicating the subscript of the shape of the tensor, for example, [[[1,2],[3,4]],[[5,6] [7,8]]] is a three-dimensional tensor with a shape of (2,2,2), then the 0-axis represents the data of the first dimension: [[1,2],[3,4]] and [[5 ,6][7,8]]] These two matrices; the 1 axis represents the data of the second dimension: [1,2], [3,4], [5,6] and [7,8]; the 2 axis represents Data for the third dimension: 1, 2, 3, 4, 5, 6, 7, and 8.
  • the tensor shape shown in Figure 18 is (4,20,20,3)
  • the 0-axis is the batch size data of the feature map
  • the 1-axis is the data of the feature map height
  • the 2-axis is the feature map width
  • the data and the 3 axes are the data of the feature map channel.
  • An operator which can also be called an operation unit, a calculation unit or an operator, represents a symbolic operation process and is the basic unit of the mainstream deep learning framework, that is, the node in the graph.
  • the input and output of the operation unit are tensors. All transformations learned by deep networks can be reduced to a few tensor operations on tensors of numerical data.
  • Common computing units include add unit, batch normalization (BatchNormalization) unit, convolution unit, gated recurrent unit (Gated Recurrent Unit), local response normalization (local response normalization, LRN) unit, long short-term memory (long short-term memory, LSTM) unit, maximum pooling (max pool) unit, sparse activation function (rectified liner uints, ReLU), recurrent neural networks (recurrent neural networks, RNN) unit and Softmax function, etc.
  • Computation graph also known as data flow graph, is defined as directed acyclic graph (DAG).
  • DAG directed acyclic graph
  • Both tensor and operation unit are objects in the graph, the operation unit is the node of the graph, and the tensor is the data flowing on the edge of the graph.
  • Acyclic means that the graph cannot have cycles, for example, a tensor x cannot be an input to a layer that generates x.
  • the only allowed processing loops i.e. recurrent connections
  • each node represents a neuron, and if the output of one node is used as the input of another node, the two nodes share a side. That is, the nodes in this computation graph represent operators, and the edges between nodes represent data dependencies between two nodes.
  • Operator splitting is the splitting of the input tensor and output tensor of the operator.
  • FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application.
  • the deep learning compiler will be briefly introduced below in conjunction with FIG. 1 .
  • the deep learning compiler can be divided into the front end of the compiler, the middle end of the compiler, and the back end of the compiler.
  • the front end of the compiler is connected to the application layer, that is, the front end of the compiler is connected to the deep learning model.
  • the parser mainly converts the models trained under different frameworks into an internal format recognizable by the hardware, for example, converts the calculation graph of a framework such as tensorflow or caffe2 into a calculation graph of an internal recognizable format.
  • the middle end of the compiler includes a graph optimizer and operator information, etc.
  • the graph optimizer can also be called a graph optimization module.
  • the middle end of the compiler assigns different computing tasks to different computing resources (for example, CPU, GPU) for subsequent model execution.
  • the back end of the compiler is mainly to automatically produce code instructions that match different hardware.
  • the back end of the compiler includes Sub-compiler and operator library, etc.
  • Deep learning compilers usually improve the performance of the model on different devices at two levels of graph optimization and operator optimization.
  • Graph optimization and operator optimization are relatively decoupled and independent.
  • Graph optimization is a general optimization strategy, that is, an optimization strategy that has nothing to do with specific operator types, while operator optimization is an operator-specific optimization strategy, that is, an optimization strategy that is related to specific operator types.
  • Typical types of operator optimization strategies include compute optimization and schedule optimization.
  • a single specific operator can be optimized to the extreme on a specific hardware platform.
  • GEMM general matrix-matrix multiplication
  • it is usually through blocking, vectorization, loop permutation, data rearrangement (packing), multi-core Manual scheduling technologies such as multi-thread parallel (parallel) optimize the scheduling of GEMM operators, so that GEMM operators can obtain dozens of times the performance benefits on the CPU.
  • Typical graph optimization strategies include constant folding. Constant folding means that if all the input tensors that an operator depends on are constants, then at compile time, the operator nodes that are not related to the model operation can be calculated in advance, thereby saving the overhead of the model runtime. .
  • graph optimization strategies such as graph segmentation and execution order optimization, multi-die parallelism, multi-thread parallelism, and chip parallelism. These graph optimization strategies need to be based on the principle of the operator itself. If it is not based on the principle of the operator itself, parallel optimization strategies cannot be expressed in the calculation graph.
  • graph segmentation and execution order optimization is a graph optimization strategy to reduce the memory limit of operator operation. Specifically, through the uniform segmentation of the outer loop iteration variables of the operator, based on this, the subsequent execution of the operator is adjusted. order, so that the operator can perform a large number of iterative operations locally, reduce the memory requirements of the operator, and store more intermediate results generated by local operations in the L2 cache, thereby reducing the memory limit for subsequent operations of the operator, and finally optimizing Operational performance of the overall network model. Therefore, the segmentation method of the operator is particularly important.
  • the die is a chip before packaging.
  • the chip uses advanced packaging technology to accumulate computing power.
  • an operator can be split into multiple dies. Reduce data interaction between different dies. Therefore, the segmentation method of the operator is particularly important.
  • a subgraph is used as a basic operation unit, that is, a subgraph including different operators is used as a basic operation unit.
  • a subgraph is assigned to multiple threads for parallelism, it is necessary to Operators are segmented, that is, when a subgraph is divided into different computing resources for parallel execution, for example, to be executed in parallel on different CPUs, operators in the subgraph need to be segmented. Since data synchronization between multiple threads means that the running of a subgraph ends, when multiple threads run in parallel, the interaction between different threads is also minimized. Therefore, the segmentation of operators in subgraphs is also particularly important.
  • Figure 2 is a schematic diagram of operator segmentation provided by the embodiment of the present application.
  • the tensor of the same operator can be divided into different slices, and different slices can run on different threads, different dies, or different chips.
  • the input tensor of operator 1 is divided into slice 1 of operator 1 and slice 2 of operator 1
  • the input tensor of operator 2 is divided into slice 1 of operator 2 and slice 2 of operator 2.
  • Slice 2 of operator 2 slice 1 of operator 1 runs on computing resource 1, and passes the running result to the computing resource 2 corresponding to slice 1 of operator 2; at the same time, slice 2 of operator 1 runs on computing resource 2 Run on computing resource 3, and pass the running result to computing resource 4 corresponding to slice 2 of operator 2 to run on, and finally splice the running results on computing resource 2 and computing resource 4.
  • the current processing method is that the algorithm engineer artificially classifies the cyclic variables in the necessary operators, and summarizes the changes that occur after the axis segmentation of each type of cyclic variables. This enables efficient generation of auxiliary graph optimization strategies. At present, the efficient generation of graph optimization strategies is inseparable from the artificial classification of the loop variables in the necessary operators, which does not allow any operator to be automatically segmented and run.
  • splittable axis for example, sample axis, parameter axis, and attribute axis
  • the effect of parallel operations is a splittable axis (for example, sample axis, parameter axis, and attribute axis) based on the operator output to split the input tensor of the operator at the application layer to achieve multi-GPU through operator splitting.
  • the effect of parallel operations is a splittable axis (for example, sample axis, parameter axis, and attribute axis) based on the operator output to split the input tensor of the operator at the application layer to achieve multi-GPU through operator splitting.
  • the sample axis, parameter axis, and attribute axis are three types of axes that can be split on the output of the operator.
  • the sample of the operator input tensor is split according to the sample axis, that is, the operator input tensor is divided in the sample dimension.
  • the sub-input tensor is split, and the operator input tensor split along the sample axis is allocated to different computing resources for data parallelism; the parameters of the operator input tensor are split according to the parameter axis, that is, Split the operator input tensor in the parameter dimension, and distribute the operator input tensor split along the parameter axis to different computing resources for model parallelism; the attribute axis is the operator output except the sample axis and the parameter axis For other axes, the operator input tensor is segmented according to the attribute axis of the sample, that is, the operator input tensor sample is segmented in the attribute dimension.
  • the operator input tensor can be divided into different computing resources for calculation. It can be divided according to the three axes alone, or mixed and divided according to the combination of the three axes to achieve multiple computing resources. The effect of parallel operation.
  • the current segmentation method can achieve a certain degree of automatic operator segmentation at the application layer, it still has certain limitations.
  • three dimensions of axes are defined according to the axes in the output tensor to perform operator segmentation, which cannot cover all the slicable axes and slicing methods of the operator.
  • the definition of these three axes is determined according to the type of the axis of the operator output tensor, that is, if there is no axis in the operator output tensor, it will not be divided from the actual situation of the operator input tensor. Segmentation. This will result in a rough segmentation of the operator input tensor, and it is impossible to accurately segment the operator segmentation and then allocate it to different computing resources for calculation; finally, this method still defines the segmentation axis and segmentation at the application layer In other words, the algorithm engineer uses the script language to determine the segmentation method based on the segmentation axis included in a certain operator type at the application layer. In this way, it is still impossible to realize the automatic segmentation of the input and output of different operators, and it is impossible to Realize the complete decoupling of graph optimization and operator optimization.
  • the embodiment of the present application proposes a method and an apparatus for processing computing tasks, which will be described in detail below with reference to FIGS. 3 to 19 .
  • Fig. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application.
  • S301 Determine a first operator for performing a computing task, where the first operator includes N divisible axes, where N is a positive integer greater than or equal to 1.
  • N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
  • the information included in the segmentation information of the first operator can indicate that each of the N slicable axes has its own corresponding axis type in the first operator, which will be described later in conjunction with Figure 7 To Fig. 17, the different types of axes and the corresponding segmentation methods of different types of axes are described in detail.
  • the information included in the splitting information of the first operator may also indicate on which input tensors of the first operator each splittable axis will appear and on which axis it will appear in these input tensors.
  • the splittable axis 1 appears in the input tensors 1 and 2 of the first operator, and the splittable axis 1 appears in the input On axis 0 in tensor 1, splittable axis 1 appears on axis 0 in input tensor 2.
  • the number of input tensors included in each of the K sets of input tensors is the same as the number of input tensors included in the first operator.
  • the input tensor of the first operator is segmented to obtain K sets of input tensors, where M is greater than or equal to 2 positive integer.
  • the graph optimizer does not necessarily need to use all computing resources.
  • the required target computing resource quantity K can be estimated, or the target can be randomly determined
  • the number K of computing resources is not limited in this embodiment of the present application.
  • each group of input tensors in the K groups of input tensors is the input tensor required by each target computing resource, for example, if the input tensor of the first operator is split, the single computing resource used to complete the computing task If a input tensor is required, then after the input tensor of the first operator is split, each target computing resource used to complete the computing task also needs a input tensor.
  • the target splitting axis is determined according to the splitting information of the first operator, and the input tensor of the first operator is split according to the target splitting axis to obtain K groups of input tensors.
  • the flow of the corresponding operator segmentation method that uses a single operator to complete a computing task will be described in detail later with reference to FIG. 4 .
  • splitting the input tensor of the first operator is not splitting all the input tensors of the first operator, but splitting the input tensor including the target splitting axis. Input tensors that do not include the target split axis are sent to each target compute resource as shared input data.
  • the space of candidate segmentation methods is determined according to the segmentation information of the first operator, the segmentation information of the second operator, and the number of computing resources M, And according to the candidate segmentation method space, determine the target segmentation method, and according to the target segmentation method, segment the input tensor of the first operator to obtain K groups of input tensors.
  • the process of using multiple operators to complete the corresponding segmentation method for computing tasks will be described in detail later in conjunction with FIG. 5 .
  • the quantity K of target computing resources is determined according to the quantity M of computing resources.
  • the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, the graph The optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, thereby realizing the complete decoupling of graph optimization and operator optimization.
  • FIG. 4 is a schematic flowchart of a corresponding operator segmentation method for completing a computing task with a single operator provided in an embodiment of the present application.
  • FIG. 4 is a specific description of a possible implementation of S303.
  • the graph optimizer randomly selects a splittable axis as the target splitting axis, for example, the first axis of the input tensor of the first operator is used as the target splitting axis, the first axis Can be a batch axis.
  • the graph optimizer selects the splittable axis with the most common axes among all the input tensors of the first operator as the target splittable axis, for example, if the first operator has 3 input tensors, among which Split axis 1 appears in 3 input tensors, and splittable axis 2 appears in 2 input tensors, then splittable axis 1 can be used as the target split axis.
  • the splittable axis with the shortest computing time is selected as the target splitting axis.
  • the target splitting axis is determined according to the computing time required to complete the computing task by the splitting method corresponding to each splittable axis and the target amount of computing resources K. For example, if the calculation time corresponding to the computing task is completed by using the split method corresponding to the splittable axis 1 and b target computing resources, and the splitting method corresponding to the splittable axis 2 and c target computing resources is used to complete the computing task corresponding to The calculation time is the same, but the number of target computing resources b corresponding to splittable axis 1 is less than the target number of computing resources c corresponding to splittable axis 2, then select splittable axis 1 as the target splitting axis, and splittable axis The number of target computing resources corresponding to 1 is b.
  • segmentation method corresponding to the axis type of the target segmentation axis in the first operator, segment the input tensor of the first operator to obtain K groups of input tensors.
  • the segmentation method corresponding to the axis type of the target segmentation axis in the first operator determine the Q first input tensors including the target segmentation axis and the target segmentation axis in the first operator The position of the sub-axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.
  • the Q first input tensors are input tensors of the first operator including the target split axis.
  • the position of the target splitting axis in each of the Q first input tensors indicates which axis the target splitting axis is on in each of the first input tensors, for example, the target splitting axis is at 1 on the 0 axis of the first input tensor and on the 0 axis of the 2nd first input tensor.
  • each of the Q first input tensors is respectively segmented, Q groups of second input tensors are obtained, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors.
  • each first input tensor including the target splitting axis will be split according to the target splitting axis, and cut into K second input tensors, and K second input tensors
  • the two input tensors are respectively used as the input tensors of the K target computing resources. Therefore, when there are Q first input tensors including the target split axis, Q groups of second input tensors will be formed.
  • the axis type of the target segmentation axis can be an elementwise axis, a sliding window axis, and a reduce axis, and the segmentation methods of these axis types will be described below in conjunction with Figures 7 to 13 explain in detail
  • K sets of input tensors are obtained according to Q sets of second input tensors and unsegmented input tensors of the first operator.
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • each set of input tensors in the K groups of input tensors includes the unsegmented input tensors of the first operator as shared data and the second input tensors of the first operator after segmentation corresponding to each target computing resource. quantity.
  • FIG. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application.
  • FIG. 5 is a specific description of another possible implementation of S303.
  • the operator for executing the calculation task further includes a second operator
  • the second operator includes P slicable axes
  • the P slicable axes are a subset of the N slicable axes.
  • the P splittable axes are a subset representation of the N splittable axes, and the P splittable axes appear in the output tensor of the first operator, and the output tensor of the first operator is used as the second operator The sub's input tensor. That is, the P divisible axes of the second operator also appear in the N divisible axes of the first operator.
  • determine P pieces of segmentation reference information, and the p-th segmentation reference information among the P segmentation reference information includes: The axis type of the slicing axis in the first operator, the axis type of the p-th slicable axis in the second operator, and the position of the p-th slicable axis in the input tensor of the first operator.
  • the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
  • each group of candidate segmentation methods is a candidate segmentation method corresponding to each segmentation reference information, that is, the segmentation reference information corresponding to each slicable axis among the P slicable axes.
  • Including at least one segmentation method in each group of candidate segmentation methods can also be understood as including M-1 segmentation methods in each group of candidate segmentation methods. For example, when the number of computing resources is 4, the target number of computing resources can be 2, 3, 4, that is, there are 3 types of target computing resource quantities, therefore, each set of candidate segmentation methods includes 3 segmentation methods.
  • the segmentation method that takes the shortest time to complete the computing task among the P group of candidate segmentation methods is determined as the target segmentation method.
  • the traversal method can be through simulation, theoretical calculation, or running on actual hardware. The embodiment of the present application does not limit the traversal method.
  • the target segmentation method is searched out from the P group of candidate segmentation methods, where there are multiple search methods, which can be Monte Carlo Marko A husband algorithm or a genetic algorithm, etc.
  • the embodiment of the present application does not limit the search method.
  • the target segmentation method is determined according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task and the target computing resource quantity K. For example, if the calculation time corresponding to the calculation task is completed using the segmentation method 1 and d target computing resources and the computing time corresponding to the calculation task is the same as the calculation time corresponding to the segmentation method 2 and e target computing resources, but the segmentation method 1 corresponds to If the target number of computing resources d is less than the target number of computing resources e corresponding to the segmentation method, then select segmentation method 1 as the target segmentation method, and the target number of computing resources corresponding to the target segmentation method is d.
  • FIG. 6 is a schematic flowchart of a corresponding operator segmentation method for multi-operators to complete computing tasks provided by an embodiment of the present application. S505 will be specifically described below in conjunction with FIG. 6 .
  • the target segmentation method determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the target segmentation axis included in the first operator.
  • each of the Q first input tensors Segment an input tensor to obtain Q groups of second input tensors, wherein each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of Q groups of second input tensors
  • the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
  • the completion of the calculation task includes the first operator and the second operator as an example for detailed description.
  • the graph optimizer also needs to obtain the segmentation information of other operators, so as to obtain candidate segmentation methods, so as to determine the target segmentation method.
  • the axis type of the input tensor of the operator in the embodiment of this application the location information of the input tensor and the output tensor of the operator that can be divided into axes, and the operators corresponding to different axis types will be described below in conjunction with Figures 7 to 17 The segmentation method is described in detail.
  • the axis type is the data dependency between the input tensor and the output of the operator, that is, the graph optimizer can determine the splitting method corresponding to the axis type according to the axis type of the input tensor. Therefore, when different operator inputs include the same axis type, they can have the same operator splitting method.
  • the axis type of the operator input tensor may include divisible axes such as element axes, reduction axes, and sliding window axes, and may also include other types of divisible axes. This is not limited.
  • Figure 7 to Figure 13 are schematic diagrams of the operator segmentation method corresponding to a single operator completing a computing task, and Operator A, Operator B, and Operator C in Figure 7 to Figure 13 can all represent the first Operator, the embodiment of this application does not limit the name of the first operator.
  • Element (elementwise) axis If an iteration variable in the input tensor of operator A is an element axis, then the element axis is the axis in which the elements in the input tensor and output tensor of operator A have a point-to-point mapping relationship, that is, in the output tensor The points of are on the same axis as the points of the input tensors that the output tensor depends on.
  • the input tensor is a 4D tensor of shape (5,7,9,3), where the 3-axis of the input tensor has a length of 3, and includes the data a0, a1, and a2 for the 3-axis of the input tensor
  • the shape of the output tensor is (4,6,8,3), where the length of the 3-axis of the output tensor is 3, including data b0, b1 and b2 for the 3-axis of the output tensor, where a0 and b0
  • the positions of a1 and b1 are corresponding, and the positions of a2 and b2 are corresponding, then the axis type of the 3-axis of the input tensor and the output tensor is the element axis.
  • FIG. 7 is a schematic diagram of a splitting method of an element axis provided in an embodiment of the present application.
  • the steps of splitting the input tensor of operator A according to the element axis are shown in Figure 7.
  • operator A is used as an example of an activation function operator for illustration.
  • the embodiment of this application does not limit the type of operator A.
  • the input tensor and output tensor of the activation function operator in Figure 7 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.
  • the type of the target segmentation axis in the activation function operator is the element axis, and according to the position information of the target segmentation axis in the activation function operator, it can be determined that the type of the target segmentation axis is the element axis that appears in the activation function operator.
  • the 0-axis of the first input tensor, that is, the 0-axis of length 8 is the element axis.
  • the forward derivation logic of the element axis is that the lengths of the element axes of the first input tensor and the first output tensor are equal.
  • the first input tensor of the activation function operator is (8, 56, 56, 64)
  • the 0 axis of the first input tensor is the element axis, according to the output tensor element axis
  • the logic that the length is equal to the length of the input tensor element axis, the length of the first output tensor 0 axis is also 8, that is, the first output tensor is (8,56,56,64).
  • the first output tensor is segmented according to the element axis to obtain the second output tensor of the activation function operator on each target computing resource. Then, according to the length of the element axis corresponding to the second output tensor of the operator on each computing resource, the function is derived through the reverse shape of the element axis Reverse deduces the element-wise axis lengths of each second input tensor.
  • the split function split
  • the second output tensor on different target computing resources needs to be spliced through the concat function. to get the first output tensor.
  • the 0-axis length of the first output tensor is 8, so the elements of the second output tensor on each computing resource
  • the length of the axis is 4, that is, the first second output tensor is (4,56,56,64), the second second output tensor is (4,56,56,64), and the second output tensor is
  • the concatenation function is used between the tensor and the first output tensor to synchronize the data. Among them, there is no intersection between the elements on the 0-axis of the first second output tensor and the elements on the 0-axis of the second second output tensor.
  • the length of the element axis of the second input tensor is reversely deduced to be 4, that is, the first second input tensor is (4,56,56,64),
  • the second second input tensor is (4,56,56,64), therefore, in order to get the shape of the second input tensor, the graph optimizer divides the first input tensor by element by calling the first slicing function Axis splitting, that is, splitting according to the 0-axis of the first input tensor to obtain two second input tensors.
  • the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 7 are equal, and the lengths of the 0-axis in the two second input tensors are also equal.
  • the second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection.
  • the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor.
  • the embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.
  • Reduce axis If an iteration variable in the input tensor of operator B is a reduce axis, then the reduce axis is an axis that exists in the input tensor of the operator but does not exist in the output tensor of the operator or has a length of 1.
  • the reduction axis can be further divided into two types.
  • the first type of reduction axis is the reduction axis on which the operator B performs a reduction operation on the elements in the input tensor.
  • the shape of the input tensor of operator B is (2,3,4,5), where the 0-axis of the input tensor is the reduction axis and the length is 2, then after the input tensor is operated by operator B, we get The shape of the output tensor is (,3,4,5) or (1,3,4,5).
  • the second type of reduction axis is the reduction axis that operator B does not perform reduction operations on the elements in the input tensor. Although operator B does not perform reduction operations on the elements on the second type of reduction axis, it does not appear in the output tensor, but it appears in the input in tensor.
  • the protocol acquisition axis specifically about the protocol acquisition axis will be described in detail in conjunction with Figure 11
  • the first type of reduction axes may include a reduction sum (reduceSum) axis, a reduction maximum value (reduceMax) axis, a reduction minimum value (reduceMin) axis, a reduction mean value (reduceMean) axis, and the like. It should be noted that these different types of reduction axes all have the general characteristics of reduction axes. The difference is that the first input tensor after splitting passes through operator B on different target computing resources in order to obtain the equivalent before splitting The types of functions that need to be called for the first output tensor are different. The specific segmentation methods of different types of first-type reduction axes will be described in detail below in conjunction with FIG. 8 to FIG. 10 .
  • FIG. 8 is a schematic diagram of a splitting method of a reduced sum axis provided by an embodiment of the present application.
  • the steps of splitting the reduced sum axis in the first input tensor of operator B are shown in FIG. 8 .
  • operator B is used as an example for illustration.
  • the embodiment of this application does not limit the type of operator B.
  • the input tensor and output tensor of the integrated sum operator in Figure 8 are described by taking a single input tensor and a single output tensor as an example. In the embodiment of the present application, the input tensor and The number of output tensors is not limited.
  • the type of the target split axis in the integrated sum operator is the reduced sum axis, and according to the position information of the target split axis in the integrated sum operator, it can be determined that the target split axis is the reduced sum axis that appears in Integrate the 0 axis of the first input tensor of the sum operator.
  • the first input tensor of the integrated sum operator is (8, 56, 56, 64)
  • the axis type of the 0 axis of the first input tensor is the reduced sum axis, also That is, the 0 axis with a length of 8 is the sum axis of the statute. Therefore, according to the characteristics of the reduced sum axis, the length of the reduced sum axis of the first output tensor is 1, that is, the first output tensor is (, 56, 56, 64).
  • the first input tensor is divided into two second input tensors according to the sum axis of the norm, and the two The second input tensor is sent to two target computing resources for operation, and two second output tensors are obtained.
  • the first second output tensor is (, 56, 56, 64), and the second second output tensor
  • the tensor is (, 56, 56, 64), and the first output tensor is obtained by synchronizing the data on the two target computing resources by calling the add (AddN) function.
  • the length of the reduced sum axis of the first output tensor is 1, and the first output tensor of the operator on each computing resource
  • the two output tensors are added to obtain the first output tensor, and the shape of the second output tensor is the same as that of the first output tensor, both are (, 56, 56, 64).
  • the reduction sum axis of the first input tensor is divided by the segmentation operator to obtain the second input tensor, wherein the length of the reduction sum axis of the second input tensor is 4, that is, the shape of the second input tensor is (4,56,56,64).
  • the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 8 are equal, and the lengths of the 0-axis in the two second input tensors are also equal.
  • the second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection.
  • the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor.
  • the embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.
  • FIG. 9 is a schematic diagram of a splitting method of a reduced maximum value axis provided by an embodiment of the present application. The steps of dividing the reduced maximum value axis of the first input tensor of operator B are shown in FIG. 9 .
  • operator B is used as an example for illustration.
  • the embodiment of this application does not limit the type of operator B.
  • the input tensor and output tensor of the integrated maximum operator in Figure 9 are described using a single input tensor and a single output tensor as an example. The number of output tensors is not limited.
  • the main steps and the first input tensor of the integrated sum operator in Figure 8 The division of the reduced sum axis of the quantity is generally the same.
  • the type of function that needs to be called by operator B after the sum axis segmentation of the first input tensor specification and the maximum axis segmentation of the first input tensor specification Different, the second output tensor of the integrated sum operator operation is performed on the second input tensor on different target computing resources, and the first output tensor is obtained by calling the addition function for data synchronization. Perform the maximum value operator operation on the second input tensor, and perform data synchronization by calling the maximum value function to obtain the first output tensor.
  • the main steps and the integrated sum operator in Figure 8 The division of the first input tensor according to the sum axis of the stipulation is generally the same.
  • Fig. 10 is a schematic diagram of a splitting method of a statistic mean axis provided by an embodiment of the present application.
  • the steps of splitting the input tensor of operator B according to the reduced mean axis are shown in Figure 10.
  • operator B is used as an example for illustration.
  • the embodiment of this application does not limit the type of operator B.
  • the input tensor and output tensor of the integrated average operator in Figure 10 are described using a single input tensor and a single output tensor as an example.
  • the number of output tensors is not limited.
  • the addition function is a synchronization node that sums the integrated average axes of the second output tensors of different computing resources
  • the multiplication function is the integrated average of the intermediate output tensors that have been summed and synchronized
  • the value axis is multiplied by 1/group to get the first output tensor, where group is the number of target computing resources, for example, in Figure 10, the target computing resources are 2, then group is 2.
  • the second type of reduction axis includes the reduce-gather axis.
  • Operator B indexes data on the elements of operator B’s input tensor according to the address indicated by the element on the input tensor of operator B’s index, that is, When the first input tensor contains a protocol collection axis, it is necessary to find the corresponding data in the protocol collection axis of the first input tensor according to the address on the first index input tensor (indice) as the 0 axis of the first output tensor The data.
  • Fig. 11 is a method of dividing the collection axis provided by the embodiment of the application.
  • the operator B is taken as the gather (gather2) operator as an example to describe in detail.
  • the embodiment of the application does not limit the type of the operator B. It should be noted that the input tensor of the acquisition operator in Figure 11 is an index input tensor and a first input tensor, and the output tensor of the acquisition operator is explained by taking a first output tensor as an example. The embodiment of the present application does not limit the number of the first input tensor and the first output tensor of the operator.
  • the acquisition operator has two input tensors, namely the first input tensor and the first index input tensor, wherein the first input tensor is the data input tensor , which has shape (80,64), and the first index input tensor is the input tensor including the index address, which has shape (20,).
  • the segmentation information of the acquisition operator determine the target segmentation axis as the protocol acquisition axis, and the protocol acquisition axis appears on the 0 axis of the first input tensor.
  • the first output tensor is 0 axis
  • the first output tensor is segmented according to the protocol collection axis by calling the third segmentation function to obtain two second input tensors.
  • Each target computing resource has a corresponding second input tensor and an index input tensor obtained by biasing the first index input tensor by calling a bias function. Then each target computing resource passes through the collection operator to obtain its own second output tensor.
  • the second output tensors on different target computing resources are added and data synchronized to obtain the first output tensor.
  • the first output tensor has no specification acquisition axis, the length of the 0-axis in the first output tensor and the first index input tensor 0-axis are equal in length. Since there are two computing resources, the second input tensor is obtained by calling the third slicing function to split the reduction acquisition axis of the first input tensor, wherein the length of the reduction acquisition axis of the second input tensor is is 40, that is, the shape of the first second input tensor is (40,64), and the shape of the second second input tensor is also (40,64).
  • each computing resource When each computing resource performs collection operator operations, it will obtain the same first index input tensor. Since the collection operator on each computing resource only obtains half of the data on the collection axis of the first input tensor specification, it is also It is the 1st second input tensor and the 2nd second input tensor, so the first index input tensor needs to go through the bias operator operation to ensure that the first index after the acquisition operator operation on each computing resource The correctness of the second output tensor.
  • the address of the collection operator on each computing resource on the input tensor according to the first index is in the second input tensor.
  • 0 is used as the search result, and finally the collection operator operation is performed on the two computing resources.
  • the second output tensor of is subjected to the addition operator operation to obtain the first output tensor.
  • the graph optimizer since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis.
  • the input tensor of the operator is reasonably segmented.
  • the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
  • Sliding window (sliding window) axis If an iteration variable in the input tensor of operator C is a sliding window axis, then the sliding window axis is the axis on which operator C performs a sliding window scanning operation on the elements in the input tensor of operator C, If the sliding window is larger than the step, the windows of every two adjacent scans will overlap.
  • the first output tensor is divided according to the sliding window axis, when there are two target computing resources, the elements corresponding to the sliding window axis in the first output tensor are equally divided, then the first output tensor after equal division.
  • Some data on the sliding window axis also depends on the same data on the sliding window axis of the first input tensor. Therefore, there are two segmentation methods for the segmentation of the first input tensor including the sliding window axis, which will be specifically described in conjunction with FIG. 12 and FIG. 13 .
  • the length of the sliding window axis of an output tensor where x represents the length of the sliding window axis of the first input tensor, and y represents the length of the sliding window axis of the first output tensor.
  • f 2 () is related to convolution filling value, convolution kernel size, convolution step size and convolution kernel expansion coefficient.
  • Inverse shape derivation function for sliding window axes of first input tensor and first output tensor That is, reverse derivation is performed according to the length of the sliding window axis in the first output tensor, and an appropriate splitting method is determined to obtain the second output tensor and the second input tensor of each computing resource.
  • Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by an embodiment of the present application.
  • the method of splitting the input tensor of operator C with overlap according to the sliding window axis is shown in Figure 12.
  • operator C is used as an example for illustration.
  • the embodiment of this application does not limit the type of operator C.
  • the input tensor and output tensor of the convolution operator in Figure 12 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.
  • the type of the target segmentation axis in the convolution operator is a sliding window axis.
  • the target segmentation axis is a sliding window axis that appears in the convolution operator
  • the 1-axis of the first input tensor that is, the 1-axis of length 56 is the sliding window axis. Therefore, according to the forward shape derivation function of the sliding window axis and the length of the sliding window axis in the first input tensor of the convolution operator, the length of the sliding window axis in the first output tensor is forward derived.
  • the first input tensor of the operator is (1,56,56,64), and the function is derived according to the forward shape of the sliding window axis of the first input tensor and the first output tensor , where the convolution step size is 2 and the convolution kernel size is 3, the first output tensor can be obtained as (1,28,56,64).
  • the first output tensor is divided according to the sliding window axis to obtain K second output tensors. Then the function is deduced from the shape inversely by the sliding window axis, and the length of the sliding window axis of the second input tensor of the convolution operator on each target computing resource is reversely derived. According to the length of the sliding window axis in each second output tensor, the first input tensor can be sliced according to the sliding window axis by calling the first slice function to obtain the second input tensor. After the second output tensor is obtained after computing with different target computing resources, the first output tensor equivalent to the operation before splitting can be obtained by calling the splicing function.
  • the first output tensor axis 1 is the sliding window axis, and the length is 28, so each The length of the second output tensor 1-axis on computing resources is 14. Then, the logic is derived according to the reverse shape of the sliding window axis. Since the convolution step size is 2 and the convolution kernel size is 3, the length of the second input tensor of each computing resource is 29.
  • the first input tensor is sliced according to the 1-axis, and two second input tensors with a 1-axis length of 29 are obtained, wherein the two There is overlapping data on the second input tensor with a length of 29 on the 1 axis, and the data range of the 1 axis of one of the second input tensors is from 0 to 28 in the 1 axis of the first input tensor, and the other second input tensor
  • the data range of axis 1 is from 28 to 56 in axis 1 of the first input tensor
  • the 29th data in axis 1 of the first input tensor is the overlapping part of the two second input tensors.
  • the overlapping segmentation method shown in Figure 12 is suitable for scenarios where the input tensors after segmentation are subjected to operator operations on different computing resources and do not require frequent data synchronization. It is completely independent and can be parallelized. However, in some scenarios, the split input tensors need to be frequently synchronized after being calculated on different computing resources, which will cause the overlapping parts of the output tensors obtained on different computing resources to be spliced frequently, resulting in overlapping The portion keeps growing, causing unnecessary double counting. Therefore, the implementation of this application also provides another splitting method with no overlap of sliding window axes, as shown in FIG. 13 .
  • Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application. The steps of splitting the sliding window axes of the input tensor of operator C without overlap are shown in Figure 13.
  • operator C is used as an example for illustration.
  • the input tensor and output tensor of the convolution operator in Figure 13 are described using a single input tensor and a single output tensor as an example.
  • the number of input tensors and output tensors is not limited.
  • the step of deriving the length corresponding to the sliding window axis in the second input tensor of each target computing resource in FIG. 13 is the same as the overlapping segmentation step in FIG. 12.
  • the difference in Fig. 12 is that the process of splitting the first input tensor to obtain K second input tensors is different.
  • the second slicing function in (b) of Figure 13 is used to equally divide the first input tensor according to the sliding window axis, and the second slicing function 1 and the second slicing function 2 are used to obtain different target calculations
  • the overlapping part of the sliding window axis data of the second input tensor obtained by the operation of the resource operator on which the sliding window axis data in the second output tensor is jointly dependent.
  • the first splicing function 1 and the first splicing function 2 are used to combine the The third input tensor and the fourth input tensor of the sub-function, the second slicing function 1 and the second slicing function 2 are spliced according to the sliding window axis to obtain the second input tensor as each target computing resource, and the second splicing function It is used to splice the second output tensors that have been operated by convolution operators on different target computing resources to obtain the first output tensors.
  • the first input tensor is segregated according to the 1 axis, and two equally divided third output tensors are obtained, and their shapes are (1, 28, 56, 64), respectively.
  • the 1st third input tensor and the 2nd third input tensor are obtained.
  • slice the first third input tensor according to the 1 axis to get the second fourth input tensor, whose shape is (1,1,56,64), and the second
  • the data of the four input tensors on axis 1 is the last data of the first and third input tensor on the sliding window axis, that is, the 28th data of the first input tensor on axis 1.
  • the data of the four input tensors on axis 1 is the first data of the second and third input tensor on the sliding window axis, that is, the 29th data of the first input tensor on axis 1.
  • the first third input tensor and the first fourth input tensor are spliced according to the axis 1 to obtain the first second input tensor, whose shape is (1,29, 56,64), the data range of this second input tensor 1 axis is from 0 to 28 in the first input tensor 1 axis.
  • the second third input tensor and the second fourth input tensor are spliced according to the 1 axis to obtain the second second input tensor, whose shape is (1 ,29,56,64), the data range of axis 1 in the second second input tensor is from 28 to 56 in axis 1 of the first input tensor.
  • the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
  • the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
  • tensor axes are not limited to those listed in the embodiment of the present application, and there may be other tensor axes and corresponding operator segmentation methods, which are not limited in the embodiments of the present application.
  • the computing resource can be GPU, CPU, bare chip or chip, etc.
  • the embodiment of the present application does not limit the type of computing resource, and the embodiment of the present application also does not limit the number of computing resources. In the embodiment of the present application The two computing resources are just one example.
  • the graph optimizer automatically splits the operator input and output tensors according to different types of axes.
  • it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes.
  • the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , the generalization ability of the segmentation method of the first input tensor of the operator based on different types of axes is stronger.
  • the above content is a specific description of the splitting method of the input tensor and the output tensor of a single operator.
  • the position information in the input tensor and output tensor of an operator determines its segmentation method, and when multiple operators are required to complete the calculation task, the position information in the input tensor and output tensor of multiple operators can be split It makes it possible for the graph optimizer to cascade different operator segmentation methods into subgraphs.
  • the position information of the splittable axis of the operator in the input tensor and the output tensor of the operator will be described in detail below with reference to FIG. 14 .
  • Fig. 14 is a schematic diagram of position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application.
  • the position information of the operator's slicable axis in the input tensor and output tensor indicates which input tensors and which output tensors the same slicable axis is on, and the same slicable axis is in the input tensor and output tensor specific location.
  • the type of each divisible axis is one of the above-mentioned different types of axes.
  • the embodiment of the present application does not limit the number of input tensors and output tensors of the operator. Multiple input tensors can be operated by the operator to obtain multiple output tensors. There is no limit on the number of first tensor axes in input tensors and output tensors.
  • the 0-axis, 1-axis and 2-axis of the input tensor of the first feature map are the sliding window axes
  • the 3-axis is the reduction axis
  • the 0-axis of the first weight input tensor is the element axis
  • the 1-axis, 2-axis and 3-axis are The reduction axis, according to the reduction axis, does not appear on the output tensor, so the tensor axes that appear in the first output tensor are the 0 axis, 1 axis and 2 axis of the input tensor of the first feature map, and the first weight
  • the 0 axis of the input tensor therefore, the shape of the first output tensor is (8,56,56,4).
  • the data structure centered on the splittable axis includes the type of the splittable axis and the type of the input tensor in which the splittable axis appears, and the splittable axis is divided between each input tensor and Occurs in the output tensor:
  • one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1).
  • the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the divisible axis can be expressed as:
  • the data structure centered on the input tensor and output tensor includes the number of each axis in each input tensor, the number of each axis in each output tensor, and the type of axis corresponding to each number :
  • dim_slice_types map ⁇ int,AXIS_TYPE> ⁇ Indicates the type of axis corresponding to each number.
  • one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1).
  • the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the tensor axis can be expressed as:
  • the graph optimizer divides and cascades different operators into subgraphs according to the axis type in the input tensor of the operator and the position information of each axis in the input tensor and output tensor. Different axis position information may have different applications. The specific application of the operator segmentation method for processing computing tasks in the embodiment of the present application will be described in detail below with reference to FIG. 15 to FIG. 17 .
  • Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application.
  • Scenario 1 The first operator includes the output tensor of the splittable axis as the input tensor of the second operator, and optimizes the splitting of the input tensors of multiple continuous operators.
  • the axis type of the target split axis in different operators and the position information in the first input tensor and the first output tensor of different operators determine the split of the first input tensor Way.
  • the graph optimizer obtains the segmentation information of the ReLU operator and the segmentation information of the TanH operator; the axis type of the corresponding target segmentation axis in the target segmentation method determined by the graph optimizer is an element axis.
  • the element The axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the ReLU operator.
  • the shape of the first input tensor is (8, 56, 56, 64), and it is divided according to the above element axes way, the shape of the first output tensor is also (8,56,56,64).
  • the element axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the TanH operator, and the shape of the first input tensor is (8,56,56,64), Therefore, the shape of the first output tensor is also (8, 56, 56, 64) according to the division method corresponding to the above-mentioned element axis.
  • the graph optimizer knows the position information of the target splitting axis in the first input tensor and the first output tensor of different operators, when the ReLU operator and the TanH operator operate continuously, since the ReLU operator and the TanH operator The intermediate splicing operator and intermediate segmentation operator generated by the segmentation of tensors can be omitted.
  • the element axes appear on the 0 axis of the input and output tensors of the ReLU operator and the TanH operator, and only one segmentation operator node and one splicing operator node are needed to realize Continuous operation of ReLU operator and TanH operator.
  • the first input tensor is segmented according to the element axis to obtain the second input tensor of two equally divided ReLU operators.
  • the second output tensor of the ReLU operator is obtained through the operation of the ReLU operator on each computing resource, and the second output tensor of the ReLU operator is used as the second input tensor of the TanH operator.
  • the TanH operator is operated to obtain the second output tensor of the TanH operator, and finally a splicing operator operation is performed to obtain the final first output tensor.
  • the axis type of the target segmentation axis in the continuous operator is not limited.
  • the axis types of the target segmentation axis in the first operator and the second operator are the same , and both are element axes for illustration.
  • the axis types of the target split axis in the continuous operator may be the same or different, which is not limited in this embodiment of the present application.
  • continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
  • Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application.
  • the splittable axis appears on multiple input tensors and a single output tensor of a single operator.
  • the segmentation mode in the input tensor of the first operator is determined.
  • the addition operator has two first input tensors.
  • the shape of the first first input tensor x is (m, n), and the second first input tensor is The input tensor y has shape (m,).
  • the type of the slicable axis 1 is an element axis, and the slicable axis 1 appears on the 0-axis of the first input tensor x and the 0-axis of the first input tensor y, and the length is m; splittable axis2 is of type element axis, and splittable axis2 appears on axis 1 of the first input tensor x with length n.
  • the first is to segment the input tensor including the slicable axis 1 of length m
  • the second is to segment the input tensor including the length Splits the input tensor for n splittable axis 2.
  • splitting is performed on an input tensor including a splittable axis 1 of length m.
  • the first input tensor x can be equally divided into two second input tensors x0 and x1 according to the 0-axis, and the first input tensor y can be divided into two according to the 0-axis, etc. It is divided into two second input tensors y0 and y1, which are sent to two target computing resources for addition operator operation to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.
  • splitting is performed on an input tensor including a splittable axis 2 of length n.
  • the first input tensor x can be equally divided into two second input tensors x0' and x1' according to the 1-axis.
  • the first input tensor y is sent as shared data to different target computing resources. Subsequently, the addition operator operation is performed on each target computing resource to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.
  • each target computing resource can obtain the first input tensor y by addressing, or copy the first input tensor y to each target computing resource.
  • An input tensor y is shared in an unlimited manner.
  • the appropriate Operator segmentation method according to the axis type of the slicable axis included in the slicing information of the operator and the position information of the slicable axis on the input tensor and output tensor of the operator, the appropriate Operator segmentation method.
  • Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application. Scenario 3, the positions of splittable axis 1 in the first input tensor and the first output tensor of the first operator are different.
  • the graph optimizer obtains the segmentation information of the transformation operator, and determines the splittable axis 1 according to the segmentation information of the transformation operator is the element axis, the position of the split axis 1 in the first input tensor, as shown in (a) of Figure 17, the split axis 1 is the 0 axis in the first input tensor, and the split axis is in the first output tensor
  • the position of can split the 1 axis of axis 1 in the first output tensor.
  • the shape (56, 8, 56, 64) of the first output tensor can be derived from the shape (8, 56, 56, 64) of the first input tensor based on the forward shape inference function of the element axes.
  • the first output tensor is divided according to the 1-axis, and two 1-axis lengths are obtained.
  • the second output tensor of 4 and then deduce the function according to the reverse shape of the element axis, determine the 0-axis length of the two second input tensors as 4, and call the segmentation function, for the first input whose 0-axis length is 8
  • the tensor is split along the 0 axis to obtain two second input tensors.
  • such a graph optimizer only needs to know the axis type of the splittable axis of the input tensor of the operator and the position information of the splittable axis between the input tensor and the output tensor, and does not need to be based on specific
  • the input and output tensors of the operator can be properly segmented, and the complete decoupling of operator optimization and graph optimization can be realized.
  • the foregoing content is a description of the method for processing computing tasks in the embodiment of the present application.
  • the device for processing computing tasks in the embodiment of the present application will be described below in conjunction with FIG. 19 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
  • Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application.
  • the device 1900 is applied to a graph optimizer, and the device includes: a processor 1901 and a transmission interface 1902 .
  • the device may further include a memory 1903 and a bus 1904 .
  • the memory 1903 , the processor 1901 , and the transmission interface 1902 realize communication connection with each other through the bus 1904 .
  • the memory 1903 can be ROM, static storage and RAM.
  • the memory 1903 may store a program. When the program stored in the memory 1903 is executed by the processor 1901, the processor 1901 and the communication interface 1902 are used to execute various steps of the method for processing a computing task in the embodiment of the present application.
  • the processor 1901 is configured to determine a first operator for performing a computing task, where the first operator includes N divisible axes, and N is a positive integer greater than or equal to 1;
  • the processor 1901 is configured to segment the input tensor of the first operator according to the segmentation information of the first operator, and determine K groups of input tensors, where K is a positive integer greater than or equal to 2.
  • the transmission interface 1902 is used to send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete computing tasks.
  • the processor 1901 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs, so as to realize the functions required by the units in the device for processing computing tasks in the embodiment of the present application , or execute the method for processing computing tasks in the method embodiment of the present application.
  • the processor 1901 may also be an integrated circuit chip, which has a signal processing capability. During implementation, each step of the method for processing a computing task in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1901 or instructions in the form of software.
  • the aforementioned processor 1901 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1903, and the processor 1901 reads the information in the memory 1903, and combines its hardware to complete the functions required by the units included in the processing computing task device of the embodiment of the application, or execute the processing of the method embodiment of the application Calculation task method.
  • the transmission interface 1902 implements communication between the apparatus 1900 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the image to be processed can be obtained through the transmission interface 1902 .
  • the bus 1904 may include a path for transferring information between various components of the device 1900 (eg, the memory 1903, the processor 1901, the transmission interface 1902).
  • the apparatus 1900 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 1900 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 1900 may also only include the devices necessary to realize the embodiment of the present application, and does not necessarily include all the devices shown in FIG. 19 .
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory Access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs.
  • the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • An embodiment of the present application provides a computer-readable storage medium, which is used to store a computer program, and when the computer program is run on a computer, the computer executes the method for processing computing tasks as in the foregoing method embodiments.
  • An embodiment of the present application provides a computer program product, and the computer program product includes: computer program code, when the computer program code is executed, implements the method for processing computing tasks as in the foregoing method embodiments.
  • At least one means one or more, and “multiple” means two or more.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • at least one item (unit) in a, b or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • User Interface Of Digital Computer (AREA)
  • Multi Processors (AREA)

Abstract

Embodiments of the present application provide a method and device for processing a computing task. The method comprises: determining a first operator used for executing a computing task, the first operator comprising N segmentable axes, and N being a positive integer greater than or equal to 1; obtaining segmentation information of the first operator from an operator segmentation information base; segmenting an input tensor of the first operator according to the segmentation information of the first operator, so as to obtain K groups of input tensors; and respectively sending the K groups of input tensors to K target computing resources, such that the K target computing resources complete the computing task. Therefore, a graph optimizer can perform automatic segmentation of input and output tensors of an operator without the principle of a specific operator, and then complete decoupling of the graph optimizer and an operator optimization module is achieved, so that the operator corresponding to the computing task is computed in parallel on the plurality of computing resources.

Description

处理计算任务方法及装置Method and device for processing computing tasks 技术领域technical field
本申请实施例涉及人工智能领域,并且更具体地,涉及一种处理计算任务方法和装置。The embodiments of the present application relate to the field of artificial intelligence, and more specifically, to a method and device for processing computing tasks.
背景技术Background technique
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能通过研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. Artificial intelligence enables machines to have the functions of perception, reasoning and decision-making by studying the design principles and implementation methods of various intelligent machines. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
深度学习开源软件框架如tensorflow、pytorch、mxnet等为用户提供了友好的深度学习模型的编程环境,使得用户可以方便地将设计的深度学习模型部署在中心处理单元(central processing unit,CPU)、图像处理单元(graphprocessing unit,GPU)等通用计算机硬件平台。如果一个设计好的深度学习模型部署在指定设备上,一般使用指定设备厂商的前向推理框架,例如,在英伟达的GPU中使用TensorRT。如果一个设计好的深度学习模型需要在多个不同类型的设备上运行,可以通过深度学习编译器,将深度学习框架描述的模型生成在不同类型设备上有效的代码。Deep learning open source software frameworks such as tensorflow, pytorch, mxnet, etc. provide users with a friendly programming environment for deep learning models, allowing users to easily deploy the designed deep learning model on the central processing unit (CPU), image General-purpose computer hardware platforms such as graph processing units (GPUs). If a designed deep learning model is deployed on a specified device, the forward reasoning framework of the specified device manufacturer is generally used, for example, TensorRT is used in Nvidia's GPU. If a well-designed deep learning model needs to run on multiple different types of devices, the deep learning compiler can be used to generate effective code for the model described by the deep learning framework on different types of devices.
深度学习编译器通常通过图优化和算子优化,来提高模型在不同硬件上的运行性能。这两种优化通常是相对解耦,相互独立的,但是图优化的实现往往需要基于算子自身的原理,才可以得到合适的算子优化并行策略。因此,图优化器在不基于具体算子的原理的情况下,如何进行自动算子切分是一个亟待解决的问题。Deep learning compilers usually improve the running performance of models on different hardware through graph optimization and operator optimization. These two optimizations are usually relatively decoupled and independent of each other. However, the implementation of graph optimization often needs to be based on the principle of the operator itself to obtain a suitable parallel strategy for operator optimization. Therefore, it is an urgent problem to be solved how the graph optimizer performs automatic operator segmentation without being based on the principle of specific operators.
发明内容Contents of the invention
本申请实施例提供一种处理计算任务方法和装置,使得图优化器可以不基于具体算子的原理,进行算子输入和输出张量的自动切分,进而实现图优化器和算子优化模块的完全解耦,使得计算任务对应的算子在多个计算资源上并行计算。The embodiment of the present application provides a method and device for processing computing tasks, so that the graph optimizer can automatically split the input and output tensors of operators without the principle of specific operators, and then realize the graph optimizer and operator optimization module The complete decoupling of computing tasks enables operators corresponding to computing tasks to be computed in parallel on multiple computing resources.
第一方面,提供了一种处理计算任务的方法,该方法由图优化器执行,包括:确定用于执行计算任务的第一算子,第一算子包括N个可切分轴,N为大于或等于1的正整数;从算子切分信息库获取第一算子的切分信息,第一算子的切分信息包括N个可切分轴中的第n个可切分轴在第一算子中的轴类型以及第一位置信息,其中,第一位置信息用于指示第n个可切分轴在第一算子的输入张量中的位置,其中,n=1,…,N;根据第一算子的切分信息,对第一算子的输入张量进行切分,获得K组输入张量,其中K为大于或等于2的正整数;分别发送K组输入张量给K个目标计算资源,以便K个目标计算资源完成计算任务。In the first aspect, a method for processing computing tasks is provided, the method is executed by a graph optimizer, including: determining a first operator for performing computing tasks, the first operator includes N splittable axes, and N is A positive integer greater than or equal to 1; the segmentation information of the first operator is obtained from the operator segmentation information library, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in Axis type and first position information in the first operator, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n=1,...,N ;According to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2; send K sets of input tensors to K target computing resources, so that the K target computing resources can complete computing tasks.
应理解,第一算子包括的N个可切分轴表示第一算子的输入张量中包括N个可切分轴。It should be understood that the N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
还应理解,计算任务可以是人工智能领域的计算任务,例如,图像处理任务、视频处理任务、语音处理任务、自然语言处理任务等;计算任务还可以是大数据处理领域的计算任务,还可以是高性能计算(high-performance computing,HPC)领域的计算任务,本申请对此不作限制。相应地,计算任务对应的第一算子的输入张量可以为上述任意一个领域的计算任务对应的输入张量,例如,当计算任务为图像处理任务,第一算子的输入张量表示图像相关的数据。It should also be understood that the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, etc.; computing tasks may also be computing tasks in the field of big data processing, and may also be It is a computing task in the field of high-performance computing (HPC), which is not limited in this application. Correspondingly, the input tensor of the first operator corresponding to the computing task can be the input tensor corresponding to the computing task in any of the above fields. For example, when the computing task is an image processing task, the input tensor of the first operator represents the image related data.
由于目前对算子的输入张量的切分是算法工程师在应用层通过脚本语言根据某一种算子类型中包括的切分轴来确定切分方式,因此无法实现对算子输入张量的自动切分。而在本申请实施例中,图优化器通过从算子切分信息库中获取算子的切分信息,由于每个算子的切分信息可以直接从算子切分信息库中获取,因此图优化器完全不需要感知每个算子的数学语义和底层实现,就可以实现自动对算子的输入张量进行切分,从而实现图优化和算子优化的完全解耦,使得计算任务对应的算子在多个计算资源上并行计算。Since the current segmentation of the input tensor of the operator is determined by the algorithm engineer at the application layer through the script language according to the segmentation axis included in a certain operator type, it is impossible to realize the segmentation of the input tensor of the operator. Automatic segmentation. In the embodiment of this application, the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, therefore The graph optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, so as to realize the complete decoupling of graph optimization and operator optimization, so that the computing tasks correspond to The operators of are calculated in parallel on multiple computing resources.
在一种可能的实现方式中,可切分轴的轴类型为如下类型中的一种:元素轴、规约轴和滑动窗口轴;其中,算子的输入张量和输出张量中的元素具有点对点映射关系的轴为元素轴;如果算子的输入张量中有第一轴,而算子的输出张量中没有的第一轴,那么该第一轴为规约轴;算子对算子的输入张量中的元素进行滑动窗口扫描操作的轴为滑动窗口轴。In a possible implementation, the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping The axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator calculates the elements in the input tensor of the operator The axis of the sliding window scan operation is the sliding window axis.
在一种可能的实现方式中,确定目标切分轴,目标切分轴为N个可切分轴中的一个;根据第一算子的切分信息,确定目标切分轴在所述第一算子中的轴类型对应的切分方式;根据目标切分轴在所述第一算子中的轴类型对应的切分方式,对第一算子的输入张量进行切分,获得K组输入张量。In a possible implementation manner, the target segmentation axis is determined, and the target segmentation axis is one of the N possible segmentation axes; according to the segmentation information of the first operator, it is determined that the target segmentation axis is in the first The segmentation method corresponding to the axis type in the operator; according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain K groups Input tensor.
在本申请实施例中,图优化器通过对算子输入张量中不同轴所属的类型,以及轴类型对应的切分方式进行单算子切分,可以实现图优化器不基于具体算子的原理,自动获得不同的单算子切分策略,进而实现图优化器和算子优化模块的完全解耦。In the embodiment of this application, the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
在一种可能的实现方式中,根据目标切分轴在所述第一算子中的轴类型对应的切分方式,对第一算子的输入张量进行切分,获得K组输入张量包括:根据切分方式,确定第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;根据目标切分轴在第一算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量;根据Q组第二输入张量和未切分的第一算子的输入张量,获得K组输入张量。In a possible implementation, the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator to obtain K sets of input tensors Including: according to the splitting method, determining Q first input tensors including the target splitting axis in the first operator and the position of the target splitting axis in each first input tensor among the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1; according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, each of the Q first input tensors is respectively segmented Divide to get Q groups of second input tensors; get K groups of input tensors based on Q groups of second input tensors and the undivided input tensors of the first operator.
其中,Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q。Wherein, each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in Q groups of second input tensors is the qth group of Q first input tensors A first input tensor is divided into K segmentation results, where q=1,...,Q.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
在一种可能的实现方式中,在用于执行计算任务的算子还包括第二算子的情况下,第二算子包括P个可切分轴,P个可切分轴为N个可切分轴的子集,根据第一算子的切分信息,对第一算子的输入张量进行切分,获得K组输入张量包括:从算子切分信息库获取第 二算子的切分信息,第二算子的切分信息包括P个可切分轴中的第p个可切分轴在第二算子中的轴类型和第二位置信息,其中,第二用于指示第p个切分轴在第二算子的输入张量中的位置,第二算子的输入张量为第一算子的输出张量,其中,P为大于或等于1且小于或等于N的的正整数,p=1,…,P;根据第一算子的切分信息和第二算子的切分信息,确定P个切分参考信息,P个切分参考信息中的第p个切分参考信息包括:第p个可切分轴在第一算子中的轴类型、第p个可切分轴在第二算子中的轴类型、第p个可切分轴在第一算子的输入张量中的位置;根据P个切分参考信息,确定P组候选切分方式,其中,P组候选切分方式中的第p组候选切分方式包括至少一个切分方式;根据P组候选切分方式中的每个切分方式完成计算任务需要的时间,确定目标切分方式;根据目标切分方式,对第一算子的输入张量进行切分,获得K组输入张量。In a possible implementation, in the case that the operator used to perform the computing task further includes a second operator, the second operator includes P splittable axes, and the P splittable axes are N splittable axes. The subset of splitting axes, according to the splitting information of the first operator, splits the input tensor of the first operator, and obtaining K sets of input tensors includes: obtaining the second operator from the operator splitting information library The segmentation information of the second operator includes the axis type and the second position information of the p-th slicable axis among the P slicable axes in the second operator, where the second is used for Indicates the position of the p-th splitting axis in the input tensor of the second operator, which is the output tensor of the first operator, where P is greater than or equal to 1 and less than or equal to N is a positive integer, p=1,...,P; according to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the pth of the P segmentation reference information Segmentation reference information includes: the axis type of the p-th severable axis in the first operator, the axis type of the p-th severable axis in the second operator, the p-th severable axis in the first The position in the input tensor of the operator; according to the P segmentation reference information, determine the P group of candidate segmentation methods, wherein, the p-th group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method; according to P The time required for each segmentation method in the group candidate segmentation method to complete the calculation task is determined to determine the target segmentation method; according to the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors .
其中,第p组候选切分方式中包括的切分方式是根据P个切分参考信息中的第p个切分参考信息和计算资源数量M确定的。Wherein, the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
在本申请实施例中,图优化器根据不同类型的轴,自动地对算子输入和输出张量切分方式。对于图优化器而言不需要基于具体的算子的原理对输入和输出张量进行切分,只需要基于不同类型的轴对应的算子切分方式对输入和输出张量进行切分,对于算子而言,对算子的输入和输出张量进行切分前后不会改变算子的计算公式,仅改变算子的部分参数,可以实现图优化和具体算子原理的彻底解耦,进而,基于不同类型轴来进行算子的第一输入张量的切分方式的泛化能力更强,除此以外,根据算子的切分信息中包括的可切分轴的轴类型和可切分轴在算子的输入张量和输出张量上的位置信息,可以灵活选择合适的算子切分方式。In the embodiment of this application, the graph optimizer automatically splits the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes. For For the operator, the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , based on different types of axes, the generalization ability of the segmentation method of the first input tensor of the operator is stronger. In addition, according to the axis type and the slicable The position information of the split axis on the input tensor and output tensor of the operator can flexibly select the appropriate operator splitting method.
作为一种可能的实现方式中,根据目标切分方式,对第一算子的输入张量进行切分,获得K组输入张量包括:根据目标切分方式,确定目标切分轴、目标切分轴在第一算子中的轴类型、目标切分轴在第二算子中的轴类型、第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;根据目标切分轴在第一算子中的轴类型和目标切分轴在第二算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q;根据Q组第二输入张量和未切分的第一算子的输入张量,获得K组输入张量。As a possible implementation, according to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors, including: according to the target segmentation method, determining the target segmentation axis, the target segmentation The axis type of the split axis in the first operator, the axis type of the target split axis in the second operator, the Q first input tensors in the first operator including the target split axis, and the target split axis in The position in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; according to the axis type of the target split axis in the first operator and the target split axis in the second The axis type in the operator and the number of target computing resources K, respectively segment each of the Q first input tensors to obtain Q groups of second input tensors, where Q groups of second input tensors Each group of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q group of second input tensors is the qth first input tensor of the Q first input tensors. K segmentation results, where q=1,...,Q; K sets of input tensors are obtained according to Q sets of second input tensors and unsegmented input tensors of the first operator.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
作为一种可能的实现方式中,根据目标切分轴在第一算子中的轴类型和目标切分轴在第二算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量包括:如果目标切分轴在第一算子中的轴类型为元素轴或者滑动窗口轴,目标切分轴在第二算子中的轴类型为元素轴或者滑动窗口轴,那么根据目标切分轴在的第一位置信息和目标切分轴的第二位置信息,确定第一算子中包括目标切分轴的L个第一输出张量,以及目标切分轴在L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;将第一输入长度作为目标切分轴在第一算子 中的轴类型对应的正向形状推导函数的输入,获得第三输入长度,第一输入长度为目标切分轴在每个第一输入张量中的长度,其中,目标切分轴在每个第一输入张量中的长度相等;将第三输入长度作为目标切分轴在第二算子中的轴类型对应的正向形状推导函数的输入,获得第一输出长度;根据第一输出长度和目标计算资源的数量K,对L个第一输出张量按照目标切分轴进行切分,获得L组第二输出张量,L组第二输出张量中每组第二输出张量包括K个第二输出张量,L组第二输出张量中第l组第二输出张量是L个第一输出张量中第l个第一输出张量切分成K个的切分结果;将L组第二输出张量中每组第二输出张量中目标切分轴对应的K个第二输出长度分别作为目标切分轴在第二算子中轴类型对应的反向推导函数的输入,得到Q组第五输入张量中每组第五输入张量中目标切分轴对应的K个第三输入长度,其中,目标切分轴在L组第二输出张量中每组第二输出张量中第k个第二输出张量中对应的长度相等,目标切分轴在Q组第五输入张量中每组第五输入张量中第k个第二输入张量中对应的长度相等;将Q组第五输入张量中每组第五输入张量中目标切分轴对应的K个第三输入长度分别作为目标切分轴在第一算子中轴类型对应的反向推导函数的输入,得到Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,目标切分轴在Q组第二输入张量中每组第二输入张量中第k个第二输入张量中对应的长度相等;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, according to the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator and the number K of target computing resources, the Qth Each first input tensor in an input tensor is split to obtain Q groups of second input tensors including: if the axis type of the target split axis in the first operator is element axis or sliding window axis, the target split If the axis type of the axis in the second operator is an element axis or a sliding window axis, then according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, it is determined that the target segmentation axis is included in the first operator The L first output tensors of the split axis, and the position of the target split axis in each of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the target The input of the forward shape derivation function corresponding to the axis type of the splitting axis in the first operator, to obtain the third input length, the first input length is the length of the target splitting axis in each first input tensor, where, the target The length of the split axis in each first input tensor is equal; the third input length is used as the input of the forward shape derivation function corresponding to the axis type of the target split axis in the second operator, and the first output length is obtained; according to The length of the first output and the number K of target computing resources, split the L first output tensors according to the target splitting axis, and obtain L groups of second output tensors, and each group of second output tensors in L groups of second output tensors The quantity includes K second output tensors, and the l-th group of second output tensors in the L groups of second output tensors is the segmentation result of the l-th first output tensor in the L first output tensors; the L The K second output lengths corresponding to the target split axis in each group of second output tensors in each group of second output tensors are respectively used as the input of the reverse derivation function corresponding to the target split axis in the second operator's axis type, and Q group No. K third input lengths corresponding to the target splitting axis in each group of the fifth input tensors among the five input tensors, wherein, the target splitting axis corresponds to the k-th second output tensor in each group of the second output tensors in the L groups of second output tensors The lengths are equal, and the target splitting axis is equal to the length corresponding to the kth second input tensor in each group of the fifth input tensors in the Q group of fifth input tensors; the target splitting axis in each group of the fifth input tensors in the Q group of fifth input tensors corresponds to The K third input lengths of the target split axis are respectively used as the input of the reverse derivation function corresponding to the axis type in the first operator, and the K corresponding to the target split axis in each group of second input tensors in the Q group of second input tensors is obtained length of the second input, and the corresponding length of the target segmentation axis in each group of second input tensors in Q group of second input tensors is equal; For the K second input lengths corresponding to the split axes, each of the Q first input tensors is segmented according to the target split axis to obtain Q groups of second input tensors.
在本申请实施例中,将切分后的输入张量在同一个目标计算资源上进行连续算子运算,这样可以实现多目标计算资源的并行计算。In the embodiment of the present application, continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为元素轴或者滑动窗口轴时,目标切分轴的第一位置信息还用于指示目标切分轴在第一算子的输出张量中的位置,根据目标切分轴在第一算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量包括:根据目标切分轴的第一位置信息,确定第一算子中包括目标切分轴的L个第一输出张量,以及目标切分轴在L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;将第一输入长度作为目标切分轴的正向形状推导函数的输入,获得第一输出长度,第一输入长度为目标切分轴在每个第一输入张量中的长度,其中,目标切分轴在每个第一输入张量中的长度相等;根据第一输出长度和目标计算资源的数量K,对L个第一输出张量按照目标切分轴进行切分,获得L组第二输出张量,L组第二输出张量中每组第二输出张量包括K个第二输出张量;将L组第二输出张量中每组第二输出张量中目标切分轴对应的K个第二输出长度分别作为目标切分轴的反向推导函数的输入,得到Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, the first position information of the target segmentation axis is also used to indicate that the target segmentation axis is in According to the position in the output tensor of the first operator, according to the axis type of the target splitting axis in the first operator and the number K of target computing resources, each first input tensor among the Q first input tensors is sliced separately To obtain Q groups of second input tensors includes: according to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target segmentation axis to obtain the first output length, The first input length is the length of the target splitting axis in each first input tensor, wherein the length of the target splitting axis in each first input tensor is equal; according to the first output length and the number K of target computing resources, for The L first output tensors are split according to the target splitting axis to obtain L groups of second output tensors, and each group of second output tensors in the L groups of second output tensors includes K second output tensors; In the second output tensor, the K second output lengths corresponding to the target segmentation axis in each group of the second output tensor are respectively used as the input of the reverse derivation function of the target segmentation axis, and the target in each group of the second input tensor in the Q group of second input tensors is obtained The K second input lengths corresponding to the split axis; according to the K second input lengths corresponding to the target split axis in each group of Q second input tensors in each group of second input tensors, respectively for each of the Q first input tensors An input tensor is split according to the target splitting axis to obtain Q groups of second input tensors.
其中,L组第二输出张量中第l组第二输出张量是L个第一输出张量中第l个第一输出张量切分成K个的切分结果。Wherein, the l-th group of second output tensors in the L groups of second output tensors is the segmentation result of the l-th first output tensor in the L first output tensors divided into K pieces.
其中,目标切分轴在L组第二输出张量中每组第二输出张量中第k个第二输出张量中对应的长度相等,目标切分轴在Q组第二输入张量中每组第二输入张量中第k个第二输入 张量中对应的长度相等。Among them, the corresponding lengths of the target split axis in the kth second output tensor in each group of second output tensors in the L group of second output tensors are equal, and the target split axis is in the kth of each group of second input tensors in the Q group of second input tensors The corresponding lengths in the second input tensor are equal.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为元素轴时,根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量包括:根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第一切分函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target splitting axis in the first operator is an element axis, according to the Kth corresponding to the target splitting axis in each group of second input tensors in Q group Two input lengths, segment each first input tensor of Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors including: according to each group of Q groups of second input tensors K second input lengths corresponding to the target splitting axis in the input tensor, by scheduling the first splitting function, each first input tensor in the Q first input tensors is split according to the target splitting axis to obtain Q Set of second input tensors.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间无交集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为滑动窗口轴时,根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量包括:根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第一切片函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行带重叠的切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to the K corresponding to the target segmentation axis in each group of second input tensors in the Q group of second input tensors The second input length is to segment each first input tensor of the Q first input tensors according to the target segmentation axis, and to obtain Q groups of second input tensors includes: according to each group of Q groups of second input tensors For the K second input lengths corresponding to the target segmentation axis in the two input tensors, by scheduling the first slice function, each first input tensor in the Q first input tensors is sliced with overlap according to the target segmentation axis Points, get the second input tensor of Q group.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间有交集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is an intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为滑动窗口轴时,根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量包括:通过调度第二切分函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第三输入张量,Q组第三输入张量包括K个第三输入张量;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第二切片函数,分别对Q组第三输入张量中每组第三输入张量K个第三输入张量按照目标切分轴进行切分,得到Q组第四输入张量;通过调度拼接函数,将Q组第四输入张量中第q组第四输入张量中第k个第四输入张量和Q组第三输入张量中第q组第三输入张量中第k个第三输入张量按照目标切分轴进行拼接,获得Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, according to the K corresponding to the target segmentation axis in each group of second input tensors in the Q group of second input tensors For the second input length, segment each of the first input tensors of the Q first input tensors according to the target segmentation axis, and obtain Q groups of second input tensors, including: by scheduling the second segmentation function, respectively Each first input tensor in the Q first input tensors is segmented according to the target segmentation axis to obtain Q groups of third input tensors, and the Q group of third input tensors includes K third input tensors; according to Q K second input lengths corresponding to the target slicing axis in each group of second input tensors in the group of second input tensors, by scheduling the second slicing function, K third input tensors in each group of Q third input tensors The input tensor is split according to the target splitting axis to obtain the fourth input tensor of group Q; by scheduling the splicing function, the kth fourth input tensor of the fourth input tensor of the qth group of the fourth input tensor of the group Q is combined with Q The k-th third input tensor of the qth group of the third input tensors in the third input tensor of the group is spliced according to the target segmentation axis to obtain the second input tensor of the Q group.
其中,Q组第三输入张量中第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素之间无交集,并且第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of third input tensors in the Q group of third input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors, and the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
其中,Q组第二输入张量中第q组第二输入张量中第k个第二输入张量的目标切分轴 对应的元素为连续的。Among them, the elements corresponding to the target segmentation axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.
在本申请实施例中,对于滑动窗口轴不带重叠的切分方式,适合应用于不同计算资源之间需要进行频繁数据同步的场景,例如,多晶粒并行,将拼接函数作为不同裸片之间的数据同步节点,这样,不会造成重叠数据的重复计算,不会造成重叠数据不断增大,可以有效环节计算资源的运算压力和存储压力。In the embodiment of this application, the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为规约轴时,根据目标切分轴在第一算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量包括:根据目标计算资源的数量K,通过调用第三切分函数,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is a reduced axis, according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, Segment each of the Q first input tensors separately to obtain Q sets of second input tensors including: according to the number K of target computing resources, by calling the third segmentation function, the Q Each first input tensor in the first input tensor is divided to obtain Q groups of second input tensors.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间无交集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
本申请实施例中,由于规约轴的类型已经将具体的切分方式确定好,因此,在进行图优化的时候,图优化器不需要基于具体的算子的原理,就可以将包括规约轴具体算子的输入张量进行合理切分,相比于目前的算子切分方式,由于传统切分方式都是从具体算子的输出张量进行切分,由于规约轴的特点是不出现在输出张量上或者在输出张量上的长度为1,因此,传统算子切分方式无法对输入张量中的存在规约轴特点的轴进行切分。In the embodiment of this application, since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis. The input tensor of the operator is reasonably segmented. Compared with the current operator segmentation method, since the traditional segmentation method is based on the output tensor of the specific operator, the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
作为一种可能的实现方式中,规约轴包括第一类规约轴和第二类规约轴,其中,第一类归规约轴为算子对算子的输入张量中的元素进行缩减操作的规约轴,第二类归规约轴为算子对算子的输入张量中的元素不进行缩减操作的规约轴。As a possible implementation, the reduction axis includes the first type of reduction axis and the second type of reduction axis, wherein the first type of reduction axis is the reduction axis for the operator to reduce the elements in the input tensor of the operator, The second type of reduction axis is the reduction axis for which the operator does not perform a reduction operation on the elements in the operator's input tensor.
作为一种可能的实现方式中,第一类规约轴包括如下中任意一种:规约之和轴、规约最大值轴、规约最小值轴、规约平均值轴;其中,规约之和轴为算子对算子的输入张量中的元素进行求和缩减操作的规约轴;规约最大值轴为算子对算子的输入张量中的元素进行求最大值缩减操作的规约轴;规约最小值轴为算子对算子的输入张量中的元素进行求最小值缩减操作的规约轴;规约平均值轴为算子对算子的输入张量中的元素进行求平均值缩减操作的规约轴。As a possible implementation, the first type of reduction axis includes any of the following: the reduction sum axis, the reduction maximum value axis, the reduction minimum value axis, and the reduction average value axis; wherein, the reduction sum axis is an operator The reduction axis for summing and reducing the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform the maximum reduction operation on the elements in the operator's input tensor; the reduction minimum value axis is the operator pair The reduction axis on which the elements in the input tensor of the operator perform the minimum value reduction operation; the reduction average axis is the reduction axis on which the operator performs the average reduction operation on the elements in the input tensor of the operator.
作为一种可能的实现方式中,第二类规约轴包括规约采集轴,规约采集轴为算子根据算子的索引输入张量上元素指示的地址在算子的输入张量上的元素索引数据的轴。As a possible implementation, the second type of reduction axis includes the reduction acquisition axis, which is the element index data on the operator's input tensor according to the address indicated by the element on the operator's index input tensor axis.
作为一种可能的实现方式中,计算资源包括如下种类中的一种:图像处理单元GPU、中心处理单元CPU、裸片die或者芯片chip。As a possible implementation manner, the computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
第二方面,提供了一种处理计算任务的装置,其特征在于,装置应用于图优化器,装置包括处理器和传输接口:处理器用于,确定用于执行计算任务的第一算子,第一算子包括N个可切分轴,N为大于或等于1的正整数;处理器用于,从算子切分信息库获取第一算子的切分信息,第一算子的切分信息包括N个可切分轴中的第n个可切分轴在第一算子中的轴类型以及第一位置信息,其中,第一位置信息用于指示第n个可切分轴在第一算子的输入张量中的位置,其中,n=1,…,N;处理器用于,根据第一算子的切分信息,对 第一算子的输入张量进行切分,获得K组输入张量,其中K为大于或等于2的正整数;传输接口用于,分别发送K组输入张量给K个目标计算资源,以便K个目标计算资源完成计算任务。In a second aspect, there is provided a device for processing computing tasks, which is characterized in that the device is applied to a graph optimizer, and the device includes a processor and a transmission interface: the processor is used to determine a first operator for executing a computing task, and the second An operator includes N divisible axes, and N is a positive integer greater than or equal to 1; the processor is used to obtain the segmentation information of the first operator from the operator segmentation information library, and the segmentation information of the first operator Including the axis type of the nth divisible axis in the first operator among the N divisible axes and the first position information, wherein the first position information is used to indicate that the nth divisible axis is in the first The position in the input tensor of the operator, where n=1,...,N; the processor is used to segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K groups of input tensors Quantities, where K is a positive integer greater than or equal to 2; the transmission interface is used to send K sets of input tensors to K target computing resources, so that K target computing resources can complete computing tasks.
应理解,第一算子包括的N个可切分轴表示第一算子的输入张量中包括N个可切分轴。It should be understood that the N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
还应理解,计算任务可以是人工智能领域的计算任务,例如,图像处理任务、视频处理任务、语音处理任务、自然语言处理任务等;计算任务还可以是大数据处理领域的计算任务,还可以是高性能计算(high-performance computing,HPC)领域的计算任务,本申请对此不作限制。相应地,计算任务对应的第一算子的输入张量可以为上述任意一个领域的计算任务对应的输入张量,例如,当计算任务为图像处理任务,第一算子的输入张量表示图像相关的数据。It should also be understood that the computing tasks may be computing tasks in the field of artificial intelligence, such as image processing tasks, video processing tasks, speech processing tasks, natural language processing tasks, etc.; computing tasks may also be computing tasks in the field of big data processing, and may also be It is a computing task in the field of high-performance computing (HPC), which is not limited in this application. Correspondingly, the input tensor of the first operator corresponding to the computing task can be the input tensor corresponding to the computing task in any of the above fields. For example, when the computing task is an image processing task, the input tensor of the first operator represents the image related data.
由于目前对算子的输入张量的切分是算法工程师在应用层通过脚本语言根据某一种算子类型中包括的切分轴来确定切分方式,因此无法实现对算子输入张量的自动切分。而在本申请实施例中,图优化器通过从算子切分信息库中获取算子的切分信息,由于每个算子的切分信息可以直接从算子切分信息库中获取,因此图优化器完全不需要感知每个算子的数学语义和底层实现,就可以实现自动对算子的输入张量进行切分,从而实现图优化和算子优化的完全解耦,使得计算任务对应的算子在多个计算资源上并行计算。Since the current segmentation of the input tensor of the operator is determined by the algorithm engineer at the application layer through the script language according to the segmentation axis included in a certain operator type, it is impossible to realize the segmentation of the input tensor of the operator. Automatic segmentation. In the embodiment of this application, the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, therefore The graph optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, so as to realize the complete decoupling of graph optimization and operator optimization, so that the computing tasks correspond to The operators of are calculated in parallel on multiple computing resources.
作为一种可能的实现方式中,可切分轴的轴类型为如下类型中的一种:元素轴、规约轴和滑动窗口轴;其中,算子的输入张量和输出张量中的元素具有点对点映射关系的轴为元素轴;如果算子的输入张量中有第一轴,而算子的输出张量中没有第一轴,那么第一轴为规约轴;算子对算子的输入张量中的元素进行滑动窗口扫描操作的轴为滑动窗口轴。As a possible implementation, the axis type of the divisible axis is one of the following types: element axis, reduction axis, and sliding window axis; wherein, the elements in the input tensor and output tensor of the operator have a point-to-point mapping The axis of the relationship is the element axis; if there is a first axis in the input tensor of the operator, but there is no first axis in the output tensor of the operator, then the first axis is the reduction axis; the operator performs a sliding window on the elements in the input tensor of the operator The axis of the scanning operation is the sliding window axis.
作为一种可能的实现方式中,处理器具体用于:确定目标切分轴,目标切分轴为N个可切分轴中的一个;根据第一算子的切分信息,确定目标切分轴在第一算子中的轴类型对应的切分方式;根据目标切分轴在第一算子中的轴类型对应的切分方式,对第一算子的输入张量进行切分,获得K组输入张量。As a possible implementation, the processor is specifically configured to: determine the target segmentation axis, which is one of the N possible segmentation axes; determine the target segmentation according to the segmentation information of the first operator The splitting method corresponding to the axis type of the axis in the first operator; according to the splitting method corresponding to the axis type of the target splitting axis in the first operator, split the input tensor of the first operator to obtain K sets of input tensors.
作为一种可能的实现方式中,处理器具体用于:根据目标切分轴在第一算子中的轴类型对应的切分方式,确定第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;根据目标切分轴在第一算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量;根据Q组第二输入张量和未切分的第一算子的输入张量,获得K组输入张量。As a possible implementation manner, the processor is specifically configured to: determine the Qth number of the first operator that includes the target segmentation axis according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator. An input tensor and the position of the target splitting axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1; according to the position of the target splitting axis in the first operator Axis type and the number K of target computing resources, segment each first input tensor in Q first input tensors respectively to obtain Q groups of second input tensors; according to Q groups of second input tensors and uncut The input tensor of the first operator of the division, and K sets of input tensors are obtained.
其中,Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q。Wherein, each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in Q groups of second input tensors is the qth group of Q first input tensors A first input tensor is divided into K segmentation results, where q=1,...,Q.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
在本申请实施例中,图优化器通过对算子输入张量中不同轴所属的类型,以及轴类型对应的切分方式进行单算子切分,可以实现图优化器不基于具体算子的原理,自动获得不同的单算子切分策略,进而实现图优化器和算子优化模块的完全解耦。In the embodiment of this application, the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
作为一种可能的实现方式中,在用于执行计算任务的算子还包括第二算子的情况下, 第二算子包括P个可切分轴,P个可切分轴为N个可切分轴的子集,处理器具体用于:从算子切分信息库获取第二算子的切分信息,第二算子的切分信息包括P个可切分轴中的第p个可切分轴在第二算子中的轴类型和第二位置信息,其中,第二位置信息用于指示第p个切分轴在第二算子的输入张量中的位置,第二算子的输入张量为第一算子的输出张量,其中,P为大于或等于1且小于或等于N的的正整数,p=1,…,P;根据第一算子的切分信息和第二算子的切分信息,确定P个切分参考信息,P个切分参考信息中的第p个切分参考信息包括:第p个可切分轴在第一算子中的轴类型、第p个可切分轴在第二算子中的轴类型、第p个可切分轴在第一算子的输入张量中的位置;根据P个切分参考信息,确定P组候选切分方式,其中,P组候选切分方式中的第p组候选切分方式包括至少一个切分方式;根据P组候选切分方式中的每个切分方式完成计算任务需要的时间,确定目标切分方式;根据目标切分方式,对第一算子的输入张量进行切分,获得K组输入张量。As a possible implementation, in the case that the operator used to perform the calculation task also includes a second operator, the second operator includes P splittable axes, and the P splittable axes are N splittable axes. A subset of splitting axes, the processor is specifically used to: obtain the splitting information of the second operator from the operator splitting information library, and the splitting information of the second operator includes the p-th of the P splittable axes The axis type and second position information of the splittable axis in the second operator, where the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, and the second operator's The input tensor is the output tensor of the first operator, where P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P; according to the segmentation information of the first operator and the first For the segmentation information of the two operators, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: the axis type of the p-th slicable axis in the first operator, The axis type of the p-th slicable axis in the second operator, and the position of the p-th slicable axis in the input tensor of the first operator; determine the P group of candidate segmentation methods according to the P segmentation reference information , wherein, the p-th group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method; according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task, determine the target segmentation Method: According to the target segmentation method, the input tensor of the first operator is segmented to obtain K groups of input tensors.
作为一种可能的实现方式,第p组候选切分方式中包括的切分方式是根据P个切分参考信息中的第p个切分参考信息和计算资源数量M确定的。As a possible implementation manner, the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
在本申请实施例中,图优化器根据不同类型的轴,自动地对算子输入和输出张量切分方式。对于图优化器而言不需要基于具体的算子的原理对输入和输出张量进行切分,只需要基于不同类型的轴对应的算子切分方式对输入和输出张量进行切分,对于算子而言,对算子的输入和输出张量进行切分前后不会改变算子的计算公式,仅改变算子的部分参数,可以实现图优化和具体算子原理的彻底解耦,进而,基于不同类型轴来进行算子的第一输入张量的切分方式的泛化能力更强,除此以外,根据算子的切分信息中包括的可切分轴的轴类型和可切分轴在算子的输入张量和输出张量上的位置信息,可以灵活选择合适的算子切分方式。In the embodiment of this application, the graph optimizer automatically splits the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes. For For the operator, the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , based on different types of axes, the generalization ability of the segmentation method of the first input tensor of the operator is stronger. In addition, according to the axis type and the slicable The position information of the split axis on the input tensor and output tensor of the operator can flexibly select the appropriate operator splitting method.
作为一种可能的实现方式中,处理器具体用于:根据目标切分方式,确定目标切分轴、目标切分轴在第一算子中的轴类型、目标切分轴在第二算子中的轴类型、第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;根据目标切分轴在第一算子中的轴类型和目标切分轴在第二算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q;根据Q组第二输入张量和未切分的第一算子的输入张量,获得K组输入张量。As a possible implementation, the processor is specifically configured to: determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, and the target segmentation axis in the second operator according to the target segmentation method. The axis type in , the Q first input tensors including the target split axis in the first operator, and the position of the target split axis in each of the Q first input tensors, where Q is greater than Or a positive integer equal to 1; according to the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the number K of target computing resources, Q first inputs are respectively Each first input tensor in the tensor is divided to obtain Q groups of second input tensors, wherein each group of second input tensors in the Q group of second input tensors includes K second input tensors, and the Q group of second input tensors The qth group of second input tensors in the input tensor is the segmentation result of the qth first input tensor in the Q first input tensors divided into K pieces, where, q=1,...,Q; Two input tensors and the undivided input tensors of the first operator to obtain K sets of input tensors.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
作为一种可能的实现方式中,如果目标切分轴在第一算子中的轴类型为元素轴或者滑动窗口轴,目标切分轴在第二算子中的轴类型为元素轴或者滑动窗口轴,那么处理器具体用于:根据目标切分轴的第一位置信息和目标切分轴的第二位置信息,确定第一算子中包括目标切分轴的L个第一输出张量,以及目标切分轴在L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;将第一输入长度作为目标切分轴在第一算子中的轴类型对应的正向形状推导函数的输入,获得第三输入长度,第一输入长度为目标切分轴在每个第一输入张量中的长度,其中,目标切分轴在每个第一输入张量中的长度相 等;将第三输入长度作为目标切分轴在第二算子中的轴类型对应的正向形状推导函数的输入,获得第一输出长度;根据第一输出长度和目标计算资源的数量K,对L个第一输出张量按照目标切分轴进行切分,获得L组第二输出张量,L组第二输出张量中每组第二输出张量包括K个第二输出张量,L组第二输出张量中第l组第二输出张量是L个第一输出张量中第l个第一输出张量切分成K个的切分结果;将L组第二输出张量中每组第二输出张量中目标切分轴对应的K个第二输出长度分别作为目标切分轴在第二算子中轴类型对应的反向推导函数的输入,得到Q组第五输入张量中每组第五输入张量中目标切分轴对应的K个第三输入长度,其中,目标切分轴在L组第二输出张量中每组第二输出张量中第k个第二输出张量中对应的长度相等,目标切分轴在Q组第五输入张量中每组第五输入张量中第k个第二输入张量中对应的长度相等;将Q组第五输入张量中每组第五输入张量中目标切分轴对应的K个第三输入长度分别作为目标切分轴在第一算子中轴类型对应的反向推导函数的输入,得到Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,目标切分轴在Q组第二输入张量中每组第二输入张量中第k个第二输入张量中对应的长度相等;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, if the axis type of the target split axis in the first operator is element axis or sliding window axis, the axis type of the target split axis in the second operator is element axis or sliding window axis, then the processor is specifically configured to: determine L first output tensors including the target segmentation axis in the first operator according to the first position information of the target segmentation axis and the second position information of the target segmentation axis, And the position of the target splitting axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the axis of the target splitting axis in the first operator The input of the forward shape derivation function corresponding to the type, and the third input length is obtained. The first input length is the length of the target split axis in each first input tensor, where the target split axis is in each first input tensor. The lengths are equal; the third input length is used as the input of the forward shape derivation function corresponding to the axis type of the target segmentation axis in the second operator to obtain the first output length; according to the first output length and the number K of target computing resources , split the L first output tensors according to the target splitting axis to obtain L groups of second output tensors, each group of second output tensors in L groups of second output tensors includes K second output tensors, L The lth group of second output tensors in the group of second output tensors is the result of cutting the lth first output tensor of the L first output tensors into K; The K second output lengths corresponding to the target segmentation axis are respectively used as the input of the reverse derivation function corresponding to the axis type of the target segmentation axis in the second operator, and the target segmentation in each group of the fifth input tensors in the Q group of fifth input tensors is obtained. The K third input lengths corresponding to the sub-axis, wherein, the target split axis is equal to the length corresponding to the k-th second output tensor in each group of second output tensors in the L group of second output tensors, and the target split axis is in the Q-th group of second output tensors. The lengths corresponding to the k-th second input tensor in each group of fifth input tensors among the five input tensors are equal; the K third input lengths corresponding to the target segmentation axes in each group of fifth input tensors in the Q group of fifth input tensors are respectively used as target cuts The split axis is the input of the reverse derivation function corresponding to the axis type in the first operator, and K second input lengths corresponding to the target split axes in each group of second input tensors in the Q group of second input tensors are obtained. The target split axes are in The lengths corresponding to the kth second input tensors in each group of second input tensors in the Q group of second input tensors are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, Segment each of the Q first input tensors according to the target segmentation axis to obtain Q groups of second input tensors.
在本申请实施例中,将切分后的输入张量在同一个目标计算资源上进行连续算子运算,这样可以实现多目标计算资源的并行计算。In the embodiment of the present application, continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为元素轴或者滑动窗口轴时,目标切分轴的第一位置信息还用于指示目标切分轴在第一算子的输出张量中的位置,处理器具体用于:根据目标切分轴的第一位置信息,确定第一算子中包括目标切分轴的L个第一输出张量,以及目标切分轴在L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;将第一输入长度作为目标切分轴的正向形状推导函数的输入,获得第一输出长度,第一输入长度为目标切分轴在每个第一输入张量中的长度,其中,目标切分轴在每个第一输入张量中的长度相等;根据第一输出长度和目标计算资源的数量K,对L个第一输出张量按照目标切分轴进行切分,获得L组第二输出张量,L组第二输出张量中每组第二输出张量包括K个第二输出张量,L组第二输出张量中第l组第二输出张量是L个第一输出张量中第l个第一输出张量切分成K个的切分结果;将L组第二输出张量中每组第二输出张量中目标切分轴对应的K个第二输出长度分别作为目标切分轴的反向推导函数的输入,得到Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,其中,目标切分轴在L组第二输出张量中每组第二输出张量中第k个第二输出张量中对应的长度相等,目标切分轴在Q组第二输入张量中每组第二输入张量中第k个第二输入张量中对应的长度相等;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis or a sliding window axis, the first position information of the target segmentation axis is also used to indicate that the target segmentation axis is in The position in the output tensor of the first operator, the processor is specifically configured to: according to the first position information of the target segmentation axis, determine the L first output tensors in the first operator including the target segmentation axis, and the target segmentation axis The position of the split axis in each of the first output tensors of the L first output tensors, L is a positive integer greater than or equal to 1; the first input length is used as the input of the forward shape derivation function of the target split axis to obtain the first An output length, the first input length is the length of the target splitting axis in each first input tensor, wherein, the length of the target splitting axis in each first input tensor is equal; according to the first output length and the target computing resource Quantity K, segment the L first output tensors according to the target segmentation axis to obtain L groups of second output tensors, and each group of second output tensors in L groups of second output tensors includes K second output tensors , the lth group of second output tensors in L groups of second output tensors is the result of cutting the lth first output tensor of L first output tensors into K; The K second output lengths corresponding to the target split axis in the output tensor are respectively used as the input of the reverse derivation function of the target split axis, and the K second output lengths corresponding to the target split axis in each group of second input tensors in the Q group of second input tensors are obtained. Two input lengths, where the target splitting axis is equal to the length corresponding to the kth second output tensor in each group of second output tensors in L groups of second output tensors, and the target splitting axis is the second in each group of second input tensors in Q group The lengths corresponding to the k-th second input tensor in the input tensor are equal; according to the K second input lengths corresponding to the target segmentation axes in each group of second input tensors in the Q group of second input tensors, respectively for each of the Q first input tensors The first input tensor is split according to the target splitting axis to obtain Q groups of second input tensors.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为元素轴时,处理器具体用于:根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第一切分函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is an element axis, the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors The K second input lengths corresponding to the axes, by scheduling the first slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q sets of second input tensors quantity.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间无交集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为滑动窗口轴时,处理器具体用于:根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第一切片函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行带重叠的切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: according to the target segmentation in each group of second input tensors in the Q group of second input tensors For the K second input lengths corresponding to the split axis, by scheduling the first slice function, each of the Q first input tensors is segmented with overlap according to the target split axis to obtain Q groups The second input tensor.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间有交集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is an intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为滑动窗口轴时,处理器具体用于:通过调度第二切分函数,分别对Q个第一输入张量中的每个第一输入张量按照目标切分轴进行切分,得到Q组第三输入张量,Q组第三输入张量包括K个第三输入张量;根据Q组第二输入张量中每组第二输入张量中目标切分轴对应的K个第二输入长度,通过调度第二切片函数,分别对Q组第三输入张量中每组第三输入张量K个第三输入张量按照目标切分轴进行切分,得到Q组第四输入张量;通过调度拼接函数,将Q组第四输入张量中第q组第四输入张量中第k个第四输入张量和Q组第三输入张量中第q组第三输入张量中第k个第三输入张量按照目标切分轴进行拼接,获得Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is a sliding window axis, the processor is specifically configured to: by dispatching the second segmentation function, each of the Q first inputs Each first input tensor in the tensor is split according to the target splitting axis to obtain Q group third input tensors, Q group third input tensors include K third input tensors; according to Q group second input tensors For the K second input lengths corresponding to the target slicing axis in each group of second input tensors, by scheduling the second slice function, K third input tensors in each group of Q third input tensors are respectively processed according to the target The splitting axis is divided to obtain the fourth input tensor of Q group; by scheduling the splicing function, the k-th fourth input tensor of the qth group of fourth input tensor in Q group and the third input tensor of Q group The kth third input tensor in the qth group of third input tensors is spliced according to the target splitting axis to obtain the Q group of second input tensors.
其中,Q组第三输入张量中第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素之间无交集,并且第q组第三输入张量中每个第三输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of third input tensors in the Q group of third input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors, and the elements corresponding to the target splitting axis in each third input tensor of the qth group of third input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
其中,Q组第二输入张量中第q组第二输入张量中第k个第二输入张量的目标切分轴对应的元素为连续的。Wherein, the elements corresponding to the target split axis of the kth second input tensor in the qth group of second input tensors in the Q group of second input tensors are continuous.
在本申请实施例中,对于滑动窗口轴不带重叠的切分方式,适合应用于不同计算资源之间需要进行频繁数据同步的场景,例如,多晶粒并行,将拼接函数作为不同裸片之间的数据同步节点,这样,不会造成重叠数据的重复计算,不会造成重叠数据不断增大,可以有效环节计算资源的运算压力和存储压力。In the embodiment of this application, the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
作为一种可能的实现方式中,当目标切分轴在第一算子中的轴类型为规约轴时,处理器具体用于:根据目标计算资源的数量K,通过调用第三切分函数,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量。As a possible implementation, when the axis type of the target segmentation axis in the first operator is the reduction axis, the processor is specifically configured to: call the third segmentation function according to the number K of target computing resources, Segment each of the Q first input tensors respectively to obtain Q groups of second input tensors.
其中,Q组第二输入张量中第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素为Q个第一输入张量中第q个第一输入张量中目标切分轴对应的元素的子集,并且第q组第二输入张量中每个第二输入张量中目标切分轴对应的元素之间无交集,并且第 q组第二输入张量中每个第二输入张量中目标切分轴对应的元素的并集为第q个第一输入张量中目标切分轴对应的元素。Among them, the element corresponding to the target splitting axis in the qth group of second input tensors in the Q group of second input tensors is the element corresponding to the target splitting axis in the qth first input tensor of the Q first input tensors Subset, and there is no intersection between the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors, and the elements corresponding to the target split axis in each second input tensor of the qth group of second input tensors The union of is the element corresponding to the target split axis in the qth first input tensor.
本申请实施例中,由于规约轴的类型已经将具体的切分方式确定好,因此,在进行图优化的时候,图优化器不需要基于具体的算子的原理,就可以将包括规约轴具体算子的输入张量进行合理切分,相比于目前的算子切分方式,由于传统切分方式都是从具体算子的输出张量进行切分,由于规约轴的特点是不出现在输出张量上或者在输出张量上的长度为1,因此,传统算子切分方式无法对输入张量中的存在规约轴特点的轴进行切分。In the embodiment of this application, since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis. The input tensor of the operator is reasonably segmented. Compared with the current operator segmentation method, since the traditional segmentation method is based on the output tensor of the specific operator, the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
作为一种可能的实现方式中,规约轴包括第一类规约轴和第二类规约轴,其中,第一类归规约轴为算子对算子的输入张量中的元素进行缩减操作的规约轴,第二类归规约轴为算子对算子的输入张量中的元素不进行缩减操作的规约轴。As a possible implementation, the reduction axis includes the first type of reduction axis and the second type of reduction axis, wherein the first type of reduction axis is the reduction axis for the operator to reduce the elements in the input tensor of the operator, The second type of reduction axis is the reduction axis for which the operator does not perform a reduction operation on the elements in the operator's input tensor.
作为一种可能的实现方式中,第一类规约轴包括如下中任意一种:规约之和轴、规约最大值轴、规约最小值轴、规约平均值轴;其中,规约之和轴为算子对算子的输入张量中的元素进行求和缩减操作的规约轴;规约最大值轴为算子对算子的输入张量中的元素进行求最大值缩减操作的规约轴;规约最小值轴为算子对算子的输入张量中的元素进行求最小值缩减操作的规约轴;规约平均值轴为算子对算子的输入张量中的元素进行求平均值缩减操作的规约轴。As a possible implementation, the first type of reduction axis includes any of the following: the reduction sum axis, the reduction maximum value axis, the reduction minimum value axis, and the reduction average value axis; wherein, the reduction sum axis is an operator The reduction axis for summing and reducing the elements in the operator's input tensor; the reduction maximum axis is the reduction axis for the operator to perform the maximum reduction operation on the elements in the operator's input tensor; the reduction minimum value axis is the operator pair The reduction axis on which the elements in the input tensor of the operator perform the minimum value reduction operation; the reduction average axis is the reduction axis on which the operator performs the average reduction operation on the elements in the input tensor of the operator.
作为一种可能的实现方式中,第二类规约轴包括规约采集轴,规约采集轴为算子根据算子的索引输入张量上元素指示的地址在算子的输入张量上的元素索引数据的轴。As a possible implementation, the second type of reduction axis includes the reduction acquisition axis, which is the element index data on the operator's input tensor according to the address indicated by the element on the operator's index input tensor axis.
作为一种可能的实现方式中,目标计算资源包括如下种类中的一种:图像处理单元GPU、中心处理单元CPU、裸片die或者芯片chip。As a possible implementation manner, the target computing resource includes one of the following types: an image processing unit GPU, a central processing unit CPU, a die, or a chip.
作为一种可能的实现方式中,该装置还可以包括存储器,存储器中存储有指令,处理器用于执行存储器上存储的指令,当指令被执行时,处理器用于执行第一方面中的任意一种实现方式中的方法。As a possible implementation, the device may further include a memory, and instructions are stored in the memory, and the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute any one of the first aspect. method in the implementation.
第三方面,提供一种计算机可读介质,该计算机可读介质存储程序代码,该程序代码包括用于执行第一方面中的任意一种实现方式中的方法。In a third aspect, a computer-readable medium is provided, where the computer-readable medium stores program code, where the program code is used to execute the method in any implementation manner in the first aspect.
附图说明Description of drawings
图1是本申请实施例提供的一种深度学习编译器架构示意图;FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种算子切分的示意图;FIG. 2 is a schematic diagram of operator segmentation provided in the embodiment of the present application;
图3是本申请实施例提供的一种处理计算任务方法流程示意图;FIG. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application;
图4是本申请实施例提供的一种单算子完成计算任务对应的算子切分方式流程示意图;Fig. 4 is a schematic flow diagram of an operator segmentation method corresponding to a single operator completing a computing task provided by an embodiment of the present application;
图5是本申请实施例提供的另一种处理计算任务方法流程示意图;Fig. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application;
图6是本申请实施例提供的一种多算子完成计算任务对应的算子切分方式流程示意图;Fig. 6 is a schematic flowchart of an operator segmentation method corresponding to multi-operator completion of computing tasks provided by the embodiment of the present application;
图7是本申请实施例提供的一种元素轴的切分方式示意图;Fig. 7 is a schematic diagram of a splitting method of an element axis provided in the embodiment of the present application;
图8是本申请实施例提供的一种规约之和轴的切分方式示意图;Fig. 8 is a schematic diagram of a splitting method of a statute sum axis provided by an embodiment of the present application;
图9是本申请实施例提供的一种规约最大值轴的切分方式示意图;Fig. 9 is a schematic diagram of a splitting method of the statute maximum value axis provided by the embodiment of the present application;
图10是本申请实施例提供的一种规约平均值轴的切分方式示意图;Fig. 10 is a schematic diagram of a splitting method of a statute mean axis provided by the embodiment of the present application;
图11是本申请实施例提供的一种规约采集轴的切分方式示意图;Fig. 11 is a schematic diagram of a segmentation method of a protocol collection axis provided by the embodiment of the present application;
图12是本申请实施例提供的一种滑动窗口轴切分方式示意图;Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by the embodiment of the present application;
图13是本申请实施例提供的另一种滑动窗口轴切分方式示意图;Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application;
图14是本申请实施例提供的一种算子可切分轴在算子的输入张量和输出张量中的位置信息示意图;Fig. 14 is a schematic diagram of the position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application;
图15是本申请实施例提供的一种算子切分具体应用的示意图;Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application;
图16是本申请实施例提供的另一种算子切分具体应用的示意图;Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application;
图17是本申请实施例提供的又一种算子切分具体应用的示意图;Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application;
图18是本申请实施例提供的一种算子张量结构示意图;Fig. 18 is a schematic diagram of an operator tensor structure provided by an embodiment of the present application;
图19是本申请实施例提供的一种处理计算任务装置示意图。Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请以下各实施例中,“至少一个”、“一个或多个”是指一个、两个或两个以上。术语“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。The terms used in the following examples are for the purpose of describing particular examples only, and are not intended to limit the application. As used in the specification and appended claims of this application, the singular expressions "a", "an", "said", "above", "the" and "this" are intended to also Expressions such as "one or more" are included unless the context clearly dictates otherwise. It should also be understood that in the following embodiments of the present application, "at least one" and "one or more" refer to one, two or more than two. The term "and/or" is used to describe the association relationship of associated objects, indicating that there may be three types of relationships; for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists alone, Wherein A and B can be singular or plural. The character "/" generally indicates that the contextual objects are an "or" relationship.
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" or the like in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.
为了便于理解本申请的技术方案,首先对本申请涉及的概念做简要介绍。In order to facilitate the understanding of the technical solution of the present application, a brief introduction is first made to the concepts involved in the present application.
(1)深度学习模型(1) Deep learning model
深度学习模型是指一种包含深度神经网络结构的机器学习模型。算法工程师使用深度学习框架构建好模型,并对该模型进行调参和训练优化后,将最终生成的网络参数和模型结构一并保存,得到的文件即为可用于前向推理的模型文件。A deep learning model refers to a machine learning model that includes a deep neural network structure. Algorithm engineers use the deep learning framework to build a model, adjust parameters, train and optimize the model, and save the final generated network parameters and model structure together. The resulting file is a model file that can be used for forward reasoning.
不同深度学习框架训练得到的模型文件的格式不尽相同,但完整的模型文件一般都包含了张量数据、运算单元和计算图等信息。The format of the model files trained by different deep learning frameworks is different, but a complete model file generally contains information such as tensor data, computing units, and calculation graphs.
(2)张量(2) Tensor
张量(Tensor)是深度学习系统的数据容器,它可以理解为矩阵向任意纬度的扩展。仅包含一个数字的张量叫做标量(Scalar)、标量张量、零维张量或0D张量;数字组成的数组叫做向量(Vector)或一维张量或1D张量;向量组成的数组叫做矩阵(Matrix)或二维张量或2D张量;多个矩阵组合成一个新的数据可以得到一个三维张量,三维张量直观 地可以理解为数字组成的立方体;将多个三维张量组合成一个数组,可以创建一个四维张量,以此类推。深度学习处理的一般是0D到4D的张量,但处理视频数据时可能会遇到5D张量。此三维张量的0轴大小为2,1轴大小为1,2轴大小为3。Tensor is the data container of the deep learning system, which can be understood as the extension of the matrix to any latitude. A tensor containing only one number is called a scalar (Scalar), a scalar tensor, a zero-dimensional tensor, or a 0D tensor; an array of numbers is called a vector (Vector) or a one-dimensional tensor or a 1D tensor; an array of vectors is called Matrix (Matrix) or two-dimensional tensor or 2D tensor; combining multiple matrices into a new data can get a three-dimensional tensor, which can be intuitively understood as a cube composed of numbers; combining multiple three-dimensional tensors into an array, a four-dimensional tensor can be created, and so on. Deep learning generally deals with tensors from 0D to 4D, but 5D tensors may be encountered when processing video data. This 3D tensor has size 2 for the 0-axis, 1 for the 1-axis, and 3 for the 2-axis.
张量的形状(shape)表示张量每个维度的元素数量,例如,[[[1,2,3]],[[7,8,9]]]为三维张量,其中三维张量的形状为(2,1,3)。再例如,图18是本申请实施例提供的一种张量示意图,如图18所示的张量形状为(4,20,20,3),假设图18所示的张量代表特征图,其中张量形状在图18中的物理含义从左往右分别为,特征图的批量大小N为4,也就是4张图片;特征图的高度H为20,特征图的宽度W为20,也就是图片是20*20=400个像素;以及特征图的通道为3,也就是RGB通道。The shape of the tensor indicates the number of elements in each dimension of the tensor, for example, [[[1,2,3]], [[7,8,9]]] is a three-dimensional tensor, where the three-dimensional tensor The shape is (2,1,3). For another example, FIG. 18 is a schematic diagram of a tensor provided by the embodiment of the present application. The shape of the tensor shown in FIG. 18 is (4, 20, 20, 3). Assuming that the tensor shown in FIG. 18 represents a feature map, Among them, the physical meaning of the tensor shape in Figure 18 is from left to right. The batch size N of the feature map is 4, that is, 4 pictures; the height H of the feature map is 20, and the width W of the feature map is 20. That is, the picture is 20*20=400 pixels; and the channel of the feature map is 3, which is the RGB channel.
张量的轴(axis)是相对于张量的形状而言的,表示张量的形状的下标,例如,[[[1,2],[3,4]],[[5,6][7,8]]]为三维张量,其形状为(2,2,2),那么0轴表示第一维的数据:[[1,2],[3,4]]和[[5,6][7,8]]这两个矩阵;1轴表示第二维的数据:[1,2]、[3,4]、[5,6]和[7,8];2轴表示第三维的数据:1、2、3、4、5、6、7和8。再例如,如图18所示的张量形状为(4,20,20,3),0轴是特征图的批量大小的数据、1轴是特征图高度的数据、2轴是特征图宽度的数据以及3轴是特征图通道的数据。The axis of the tensor is relative to the shape of the tensor, indicating the subscript of the shape of the tensor, for example, [[[1,2],[3,4]],[[5,6] [7,8]]] is a three-dimensional tensor with a shape of (2,2,2), then the 0-axis represents the data of the first dimension: [[1,2],[3,4]] and [[5 ,6][7,8]] These two matrices; the 1 axis represents the data of the second dimension: [1,2], [3,4], [5,6] and [7,8]; the 2 axis represents Data for the third dimension: 1, 2, 3, 4, 5, 6, 7, and 8. For another example, the tensor shape shown in Figure 18 is (4,20,20,3), the 0-axis is the batch size data of the feature map, the 1-axis is the data of the feature map height, and the 2-axis is the feature map width The data and the 3 axes are the data of the feature map channel.
(3)算子(3) operator
算子(operation/operator),也可以称为运算单元、计算单元或操作符,表示一种符号化的运算过程,是主流深度学习框架的基本单元,即图中的节点。运算单元的输入和输出都是张量。深度网络学到的所有变换都可以简化为数值数据张量上的一些张量运算(tensor operation)。An operator (operation/operator), which can also be called an operation unit, a calculation unit or an operator, represents a symbolic operation process and is the basic unit of the mainstream deep learning framework, that is, the node in the graph. The input and output of the operation unit are tensors. All transformations learned by deep networks can be reduced to a few tensor operations on tensors of numerical data.
常见的运算单元有加(add)单元、批正则化(BatchNormalization)单元、卷积单元、门控循环单元(Gated Recurrent Unit)、局部响应归一化(local response normalization,LRN)单元、长短期记忆(long short-term memory,LSTM)单元、最大池化(max pool)单元、稀疏激活函数(rectified liner uints,ReLU)、循环神经网络(recurrent neural networks,RNN)单元和Softmax函数等。Common computing units include add unit, batch normalization (BatchNormalization) unit, convolution unit, gated recurrent unit (Gated Recurrent Unit), local response normalization (local response normalization, LRN) unit, long short-term memory (long short-term memory, LSTM) unit, maximum pooling (max pool) unit, sparse activation function (rectified liner uints, ReLU), recurrent neural networks (recurrent neural networks, RNN) unit and Softmax function, etc.
(4)计算图(4) Calculation graph
计算图(graph),又称数据流图,被定义为有向无环图(directed acyclic graph,DAG)。张量和运算单元都是图中的对象,运算单元是图的节点,张量是图的边上流动的数据。无环(acyclic)是指图不能有循环,例如,张量x不能成为生成x的某一层的输入。唯一允许的处理循环(即循环连接)是循环层的内部循环。Computation graph (graph), also known as data flow graph, is defined as directed acyclic graph (DAG). Both tensor and operation unit are objects in the graph, the operation unit is the node of the graph, and the tensor is the data flowing on the edge of the graph. Acyclic means that the graph cannot have cycles, for example, a tensor x cannot be an input to a layer that generates x. The only allowed processing loops (i.e. recurrent connections) are inner loops of recurrent layers.
大多数深度学习框架可以使用一个有向无环图来描述,在这个有向无环图中每个节点代表一个神经元,如果一个节点的输出作为另一个节点的输入,这两个节点共享一条边。也就是,在这个计算图中的节点表示算子,节点和节点之间的边表示两个节点之间有数据依赖关系。Most deep learning frameworks can be described using a directed acyclic graph, in which each node represents a neuron, and if the output of one node is used as the input of another node, the two nodes share a side. That is, the nodes in this computation graph represent operators, and the edges between nodes represent data dependencies between two nodes.
(5)算子切分(5) Operator Segmentation
算子切分是对算子的输入张量和输出张量的切分。Operator splitting is the splitting of the input tensor and output tensor of the operator.
图1是本申请实施例提供的一种深度学习编译器架构示意图,下面将结合图1对深度学习编译器进行简单介绍。FIG. 1 is a schematic diagram of a deep learning compiler architecture provided by an embodiment of the present application. The deep learning compiler will be briefly introduced below in conjunction with FIG. 1 .
深度学习编译器可以分为编译器前端、编译器中端和编译器后端,编译器前端对接应 用层,也就是编译器前端对接深度学习模型,编译器前端包括解析器等,编译器前端的解析器主要将不同框架下训练出来的模型,转化成硬件可识别的内部格式,例如,将tensorflow或者caffe2等框架的计算图转换为内部可识别格式的计算图。编译器中端包括图优化器和算子信息等,图优化器还可以称为图优化模块。编译器中端将不同的计算任务分配到不同的计算资源(例如,CPU、GPU)上进行后续的模型运行,编译器后端主要是自动生产不同硬件匹配的代码指令,编译器后端包括算子编译器和算子库等。The deep learning compiler can be divided into the front end of the compiler, the middle end of the compiler, and the back end of the compiler. The front end of the compiler is connected to the application layer, that is, the front end of the compiler is connected to the deep learning model. The parser mainly converts the models trained under different frameworks into an internal format recognizable by the hardware, for example, converts the calculation graph of a framework such as tensorflow or caffe2 into a calculation graph of an internal recognizable format. The middle end of the compiler includes a graph optimizer and operator information, etc. The graph optimizer can also be called a graph optimization module. The middle end of the compiler assigns different computing tasks to different computing resources (for example, CPU, GPU) for subsequent model execution. The back end of the compiler is mainly to automatically produce code instructions that match different hardware. The back end of the compiler includes Sub-compiler and operator library, etc.
深度学习编译器通常是在图优化和算子优化这两个层面上,来提高模型在不同设备上的运行性能。图优化和算子优化是相对解耦和独立的。图优化是通用优化策略,也就是和具体算子类型无关的优化策略,而算子优化是算子具体的优化策略,也就是和具体算子类型相关的优化策略。Deep learning compilers usually improve the performance of the model on different devices at two levels of graph optimization and operator optimization. Graph optimization and operator optimization are relatively decoupled and independent. Graph optimization is a general optimization strategy, that is, an optimization strategy that has nothing to do with specific operator types, while operator optimization is an operator-specific optimization strategy, that is, an optimization strategy that is related to specific operator types.
典型的算子优化策略类型有计算(compute)优化和调度(schedule)优化,通过人工或者自动调优的方式,使得单个特定具体的算子在特定硬件平台上达到极致优化。例如,对于一般矩阵乘法(general matrix-matrix multiplication,GEMM)算子而言,通常是通过切块(blocking)、向量化(vectorization)、循环变换(loop permutation)、数据重排(packing)、多核多线程并行(parallel)等人工调度技术,对GEMM算子的调度进行优化,使得GEMM算子在CPU上获得数十倍的性能收益。Typical types of operator optimization strategies include compute optimization and schedule optimization. Through manual or automatic tuning, a single specific operator can be optimized to the extreme on a specific hardware platform. For example, for a general matrix-matrix multiplication (GEMM) operator, it is usually through blocking, vectorization, loop permutation, data rearrangement (packing), multi-core Manual scheduling technologies such as multi-thread parallel (parallel) optimize the scheduling of GEMM operators, so that GEMM operators can obtain dozens of times the performance benefits on the CPU.
典型的图优化策略有常量折叠,常量折叠是如果一个算子依赖的所有输入张量都是常量,那么在编译时,可以将模型运行无关的算子节点提前计算,从而节省模型运行时的开销。Typical graph optimization strategies include constant folding. Constant folding means that if all the input tensors that an operator depends on are constants, then at compile time, the operator nodes that are not related to the model operation can be calculated in advance, thereby saving the overhead of the model runtime. .
目前,还有很多其他图优化策略,例如,图切分及执行顺序优化、多裸片并行、多线程并行以及对芯片并行等。这些图优化策略都需要基于算子本身的原理,如果不基于算子本身的原理,就无法在计算图中表达出并行优化策略。At present, there are many other graph optimization strategies, such as graph segmentation and execution order optimization, multi-die parallelism, multi-thread parallelism, and chip parallelism. These graph optimization strategies need to be based on the principle of the operator itself. If it is not based on the principle of the operator itself, parallel optimization strategies cannot be expressed in the calculation graph.
例如,图切分及执行顺序优化是一种降低算子运行的内存限制的图优化策略,具体是通过对算子的外循环迭代变量的统一切分,基于此,调整算子的后续的执行顺序,使得算子在可以在局部进行大量迭代运算,降低算子对内存的需求,将局部运算产生的更多的中间结果存储在L2缓存中,从而降低算子后续运行的内存限制,最终优化整体网络模型的运行性能。因此,算子的切分方式尤为重要。For example, graph segmentation and execution order optimization is a graph optimization strategy to reduce the memory limit of operator operation. Specifically, through the uniform segmentation of the outer loop iteration variables of the operator, based on this, the subsequent execution of the operator is adjusted. order, so that the operator can perform a large number of iterative operations locally, reduce the memory requirements of the operator, and store more intermediate results generated by local operations in the L2 cache, thereby reducing the memory limit for subsequent operations of the operator, and finally optimizing Operational performance of the overall network model. Therefore, the segmentation method of the operator is particularly important.
例如,多裸片并行,裸片是未封装前的芯片,芯片采用先进的封装技术来累计算力,为了充分发挥芯片的性能,一个算子可以被拆分到多个裸片上运算,尽可能减少不同裸片之间的数据交互。因此,算子的切分方式尤为重要。For example, multiple dies are parallelized. The die is a chip before packaging. The chip uses advanced packaging technology to accumulate computing power. In order to fully utilize the performance of the chip, an operator can be split into multiple dies. Reduce data interaction between different dies. Therefore, the segmentation method of the operator is particularly important.
再例如,多线程并行,将一个子图作为一个基本运算单元,也就是将一个包括不同算子的子图作为一个基本运算单元,当一个子图被分配到多个线程上并行时,需要对算子进行切分,也就是当一个子图被分到不同计算资源上并行执行时,例如,在不同的CPU上并行执行,需要对子图中的算子进行切分。由于一次多线程之间的数据同步,意味着一个子图的运行结束,因此,多线程并行运行时,也尽可能减少不同线程之间的交互。因此,对子图中的算子的切分也显得尤为重要。Another example is multi-threaded parallelism. A subgraph is used as a basic operation unit, that is, a subgraph including different operators is used as a basic operation unit. When a subgraph is assigned to multiple threads for parallelism, it is necessary to Operators are segmented, that is, when a subgraph is divided into different computing resources for parallel execution, for example, to be executed in parallel on different CPUs, operators in the subgraph need to be segmented. Since data synchronization between multiple threads means that the running of a subgraph ends, when multiple threads run in parallel, the interaction between different threads is also minimized. Therefore, the segmentation of operators in subgraphs is also particularly important.
图2为本申请实施例提供的一种算子切分的示意图,同一个算子的张量可以被切分成不同切片,不同切片可以运行在不同线程、不同裸片或者不同芯片上。例如,如图2所示,算子1的输入张量被分为算子1的切片1和算子1的切片2,算子2的输入张量被分为算 子2的切片1和算子2的切片2,算子1的切片1在计算资源1上运行,并将运行结果传递给算子2的切片1对应的计算资源2上运行;与此同时,算子1的切片2在计算资源上3上运行,并将运行结果传递给算子2的切片2对应的计算资源4上运行,最后对计算资源2和计算资源4上的运行结果进行拼接。Figure 2 is a schematic diagram of operator segmentation provided by the embodiment of the present application. The tensor of the same operator can be divided into different slices, and different slices can run on different threads, different dies, or different chips. For example, as shown in Figure 2, the input tensor of operator 1 is divided into slice 1 of operator 1 and slice 2 of operator 1, and the input tensor of operator 2 is divided into slice 1 of operator 2 and slice 2 of operator 2. Slice 2 of operator 2, slice 1 of operator 1 runs on computing resource 1, and passes the running result to the computing resource 2 corresponding to slice 1 of operator 2; at the same time, slice 2 of operator 1 runs on computing resource 2 Run on computing resource 3, and pass the running result to computing resource 4 corresponding to slice 2 of operator 2 to run on, and finally splice the running results on computing resource 2 and computing resource 4.
由于,目前大部分和算子切分方式相关的图优化方案都需要基于算子本身的原理,例如,图优化需要基于算子循环迭代变量的性质进行算子切分,而架构上算子优化和图优化也需要相对解耦,因此,目前一种处理方式是,算法工程师人为对必要的算子中的循环变量进行分类,并总结出每一类循环变量进行轴切分之后产生的变化,从而实现辅助图优化策略的高效生成。目前图优化策略的高效生成离不开人为地对必要的算子中的循环变量进行分类,这并不能让任意算子进行自动切分并运行。Because most of the current graph optimization schemes related to the operator segmentation method need to be based on the principle of the operator itself, for example, graph optimization needs to perform operator segmentation based on the nature of the operator loop iteration variable, while the operator optimization in the architecture And graph optimization also needs to be relatively decoupled. Therefore, the current processing method is that the algorithm engineer artificially classifies the cyclic variables in the necessary operators, and summarizes the changes that occur after the axis segmentation of each type of cyclic variables. This enables efficient generation of auxiliary graph optimization strategies. At present, the efficient generation of graph optimization strategies is inseparable from the artificial classification of the loop variables in the necessary operators, which does not allow any operator to be automatically segmented and run.
目前,有一种基于算子输出的可切分轴(例如,样本轴、参数轴和属性轴)在应用层对算子的输入张量进行切分,来实现通过算子切分在多GPU上并行运算的效果。Currently, there is a splittable axis (for example, sample axis, parameter axis, and attribute axis) based on the operator output to split the input tensor of the operator at the application layer to achieve multi-GPU through operator splitting. The effect of parallel operations.
具体地,样本轴、参数轴和属性轴分别是算子输出上的三种可切分的轴,其中,根据样本轴对算子输入张量的样本进行切分,也就是在样本维度对算子输入张量进行切分,将沿着样本轴切分的算子输入张量分配到不同的计算资源上,进行数据并行;根据参数轴对算子输入张量的参数进行切分,也就是在参数维度对算子输入张量进行切分,将沿着参数轴切分的算子输入张量分配到不同计算资源上,进行模型并行;属性轴为算子输出中除样本轴和参数轴以外的轴,根据样本的属性轴对算子输入张量进行切分,也就是在属性维度来切分算子输入张量样本进行切分。Specifically, the sample axis, parameter axis, and attribute axis are three types of axes that can be split on the output of the operator. The sample of the operator input tensor is split according to the sample axis, that is, the operator input tensor is divided in the sample dimension. The sub-input tensor is split, and the operator input tensor split along the sample axis is allocated to different computing resources for data parallelism; the parameters of the operator input tensor are split according to the parameter axis, that is, Split the operator input tensor in the parameter dimension, and distribute the operator input tensor split along the parameter axis to different computing resources for model parallelism; the attribute axis is the operator output except the sample axis and the parameter axis For other axes, the operator input tensor is segmented according to the attribute axis of the sample, that is, the operator input tensor sample is segmented in the attribute dimension.
根据这三种轴可以将算子输入张量切分到不同的计算资源上进行运算,可以根据三种轴单独进行切分,也可以根据三种轴的组合进行混合切分,实现多计算资源的并行运算的效果。虽然目前这种切分方式可以实现一定程度的应用层的自动算子切分,但是仍然有一定的局限性。首先,目前只针对矩阵乘算子根据输出张量中的轴定义了三种维度的轴来进行算子切分,无法覆盖算子所有的可切分轴以及切分方式。其次,目前这三种轴的定义是根据算子输出张量的轴的类型确定的,也就是如果算子输出张量中没有的轴,不会从算子输入张量可切分轴的实际情况进行切分。这将导致算子输入张量的切分比较粗糙,无法对算子切分进行精准切分之后分配到不同的计算资源上进行运算;最后,此方法仍然是在应用层定义切分轴以及切分方式,也就是算法工程师在应用层通过脚本语言根据某一种算子类型中包括的切分轴来确定切分方式,这样依旧无法实现针对不同算子的输入输出的自动切分,并且无法实现图优化和算子优化的完全解耦。According to these three axes, the operator input tensor can be divided into different computing resources for calculation. It can be divided according to the three axes alone, or mixed and divided according to the combination of the three axes to achieve multiple computing resources. The effect of parallel operation. Although the current segmentation method can achieve a certain degree of automatic operator segmentation at the application layer, it still has certain limitations. First of all, currently only for the matrix multiplication operator, three dimensions of axes are defined according to the axes in the output tensor to perform operator segmentation, which cannot cover all the slicable axes and slicing methods of the operator. Secondly, the definition of these three axes is determined according to the type of the axis of the operator output tensor, that is, if there is no axis in the operator output tensor, it will not be divided from the actual situation of the operator input tensor. Segmentation. This will result in a rough segmentation of the operator input tensor, and it is impossible to accurately segment the operator segmentation and then allocate it to different computing resources for calculation; finally, this method still defines the segmentation axis and segmentation at the application layer In other words, the algorithm engineer uses the script language to determine the segmentation method based on the segmentation axis included in a certain operator type at the application layer. In this way, it is still impossible to realize the automatic segmentation of the input and output of different operators, and it is impossible to Realize the complete decoupling of graph optimization and operator optimization.
为了解决上述问题,本申请实施例提出了一种处理计算任务方法和装置,下面将结合图3至图19对此进行详细描述。In order to solve the above problems, the embodiment of the present application proposes a method and an apparatus for processing computing tasks, which will be described in detail below with reference to FIGS. 3 to 19 .
图3是本申请实施例提供的一种处理计算任务方法流程示意图。Fig. 3 is a schematic flowchart of a method for processing computing tasks provided by an embodiment of the present application.
S301,确定用于执行计算任务的第一算子,第一算子包括N个可切分轴,N为大于或等于1的正整数。S301. Determine a first operator for performing a computing task, where the first operator includes N divisible axes, where N is a positive integer greater than or equal to 1.
应理解,第一算子包括的N个可切分轴表示第一算子的输入张量中包括N个可切分轴。It should be understood that the N splittable axes included in the first operator means that the input tensor of the first operator includes N splittable axes.
S302,从算子切分信息库中获取第一算子的切分信息,第一算子的切分信息包括N个可切分轴中的第n个可切分轴在第一算子中的轴类型以及第一位置信息,其中,第一位 置信息用于指示第n个可切分轴在第一算子的输入张量中的位置,其中,n=1,…,N。S302. Obtain the segmentation information of the first operator from the operator segmentation information library, where the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in the first operator The axis type and the first position information, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n=1,...,N.
也就是,第一算子的切分信息中包括的信息可以表示N个可切分轴中的每个可切分轴在第一算子中都有自已对应的轴类型,后续会结合图7至图17对不同类型的轴以及不同类型的轴对应的切分方式做详细说明。第一算子的切分信息中包括的信息还可以表示每个可切分轴会出现在哪几个第一算子的输入张量上以及出现在这些输入张量中的第几根轴中。例如,根据可切分轴1在第一算子中的位置信息,就可以知道可切分轴1出现在第一算子的输入张量1和2中,以及可切分轴1出现在输入张量1中的0轴上,可切分轴1出现在输入张量2中的0轴上。That is, the information included in the segmentation information of the first operator can indicate that each of the N slicable axes has its own corresponding axis type in the first operator, which will be described later in conjunction with Figure 7 To Fig. 17, the different types of axes and the corresponding segmentation methods of different types of axes are described in detail. The information included in the splitting information of the first operator may also indicate on which input tensors of the first operator each splittable axis will appear and on which axis it will appear in these input tensors. For example, according to the position information of the splittable axis 1 in the first operator, it can be known that the splittable axis 1 appears in the input tensors 1 and 2 of the first operator, and the splittable axis 1 appears in the input On axis 0 in tensor 1, splittable axis 1 appears on axis 0 in input tensor 2.
S303,根据第一算子的切分信息,对第一算子的输入张量进行切分,获得K组输入张量,其中K为大于或等于2的正整数。S303. Segment the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2.
应理解,K组输入张量中的每组输入张量中包括的输入张量的数量和第一算子包括的输入张量的数量相同。It should be understood that the number of input tensors included in each of the K sets of input tensors is the same as the number of input tensors included in the first operator.
作为一种可能的实现方式,根据第一算子的切分信息和计算资源数量M,对第一算子的输入张量进行切分,获得K组输入张量,M为大于或等于2的正整数。As a possible implementation, according to the segmentation information of the first operator and the number of computing resources M, the input tensor of the first operator is segmented to obtain K sets of input tensors, where M is greater than or equal to 2 positive integer.
其中,虽然可供使用的计算资源数量为M,但是图优化器并不一定需要使用所有的计算资源,例如,可以根据计算任务的大小,估算出需要的目标计算资源数量K,或者随机确定目标计算资源数量K,本申请实施例对此不作限制。Among them, although the number of available computing resources is M, the graph optimizer does not necessarily need to use all computing resources. For example, according to the size of the computing task, the required target computing resource quantity K can be estimated, or the target can be randomly determined The number K of computing resources is not limited in this embodiment of the present application.
还应理解,K组输入张量中的每组输入张量为每个目标计算资源需要的输入张量,例如,如果第一算子的输入张量切分前,用于完成计算任务的单个计算资源需要a个输入张量,那么第一算子的输入张量切分后,每个用于完成计算任务的目标计算资源同样也需要a个输入张量。It should also be understood that each group of input tensors in the K groups of input tensors is the input tensor required by each target computing resource, for example, if the input tensor of the first operator is split, the single computing resource used to complete the computing task If a input tensor is required, then after the input tensor of the first operator is split, each target computing resource used to complete the computing task also needs a input tensor.
作为一种可能的实现方式,根据第一算子的切分信息确定目标切分轴,并根据目标切分轴,对第一算子的输入张量进行切分,获得K组输入张量。后面将结合图4具体说明使用单算子完成计算任务的对应的算子切分方式流程。As a possible implementation, the target splitting axis is determined according to the splitting information of the first operator, and the input tensor of the first operator is split according to the target splitting axis to obtain K groups of input tensors. The flow of the corresponding operator segmentation method that uses a single operator to complete a computing task will be described in detail later with reference to FIG. 4 .
需要说明的是,对第一算子的输入张量进行切分,并不是对第一算子的所有输入张量进行切分,而是对包括目标切分轴的输入张量进行切分,而不包括目标切分轴的输入张量作为共享输入数据发送给每个目标计算资源。It should be noted that splitting the input tensor of the first operator is not splitting all the input tensors of the first operator, but splitting the input tensor including the target splitting axis. Input tensors that do not include the target split axis are sent to each target compute resource as shared input data.
作为一种可能的实现方式,如果执行计算任务还需要第二算子,根据第一算子的切分信息、第二算子的切分信息以及计算资源数量M,确定候选切分方式空间,并根据候选切分方式空间,确定目标切分方式,根据目标切分方式,对第一算子的输入张量进行切分,获得K组输入张量。后面将结合图5具体说明使用多算子完成计算任务的对应的切分方式流程。As a possible implementation, if a second operator is needed to perform the computing task, the space of candidate segmentation methods is determined according to the segmentation information of the first operator, the segmentation information of the second operator, and the number of computing resources M, And according to the candidate segmentation method space, determine the target segmentation method, and according to the target segmentation method, segment the input tensor of the first operator to obtain K groups of input tensors. The process of using multiple operators to complete the corresponding segmentation method for computing tasks will be described in detail later in conjunction with FIG. 5 .
S304,分别发送K组输入张量给K个目标计算资源,以便K个目标计算资源完成计算任务。S304. Send K groups of input tensors to K target computing resources respectively, so that the K target computing resources can complete computing tasks.
应理解,目标计算资源的数量K是根据计算资源数量M确定的。It should be understood that the quantity K of target computing resources is determined according to the quantity M of computing resources.
在本申请实施例中,图优化器通过从算子切分信息库中获取算子的切分信息,由于每个算子的切分信息可以直接从算子切分信息库中获取,因此图优化器完全不需要感知每个算子的数学语义和底层实现,就可以实现自动对算子的输入张量进行切分,从而实现图优化和算子优化的完全解耦。In the embodiment of this application, the graph optimizer obtains the operator segmentation information from the operator segmentation information database. Since the segmentation information of each operator can be obtained directly from the operator segmentation information database, the graph The optimizer does not need to perceive the mathematical semantics and underlying implementation of each operator at all, and can automatically segment the input tensor of the operator, thereby realizing the complete decoupling of graph optimization and operator optimization.
图4是本申请实施例提供的一种单算子完成计算任务的对应的算子切分方式流程示意图。图4是对S303的一种可能的实现方式的具体说明。FIG. 4 is a schematic flowchart of a corresponding operator segmentation method for completing a computing task with a single operator provided in an embodiment of the present application. FIG. 4 is a specific description of a possible implementation of S303.
S401,确定目标切分轴,目标切分轴为N个可切分轴中的一个。S401. Determine a target splitting axis, where the target splitting axis is one of N possible splitting axes.
作为一种可能的实现方式,图优化器随机挑选一个可切分轴作为目标切分轴,例如,将第一算子的输入张量的第一根轴作为目标切分轴,第一根轴可以是批(batch)轴。As a possible implementation, the graph optimizer randomly selects a splittable axis as the target splitting axis, for example, the first axis of the input tensor of the first operator is used as the target splitting axis, the first axis Can be a batch axis.
作为一种可能的实现方式,图优化器选择所有第一算子的输入张量中公共轴最多的可切分轴作为目标切分轴,例如,如果第一算子有3个输入张量,其中可切分轴1出现在3个输入张量中,可切分轴2出现在2个输入张量中,那么可切分轴1可以作为目标切分轴。As a possible implementation, the graph optimizer selects the splittable axis with the most common axes among all the input tensors of the first operator as the target splittable axis, for example, if the first operator has 3 input tensors, among which Split axis 1 appears in 3 input tensors, and splittable axis 2 appears in 2 input tensors, then splittable axis 1 can be used as the target split axis.
作为一种可能的实现方式,根据每个可切分轴对应的切分方式完成切分任务所需要的运算时间,选择运算时间最短的可切分轴作为目标切分轴。As a possible implementation, according to the computing time required for the segmentation task corresponding to each splittable axis, the splittable axis with the shortest computing time is selected as the target splitting axis.
作为一种可能的实现方式,根据每个可切分轴对应的切分方式完成计算任务所需要的运算时间以及目标计算资源数量K,确定目标切分轴。例如,如果采用可切分轴1和b个目标计算资源对应的切分方式来完成计算任务对应的运算时间和可切分轴2和c个目标计算资源对应的切分方式来完成计算任务对应的运算时间相同,但是可切分轴1对应的目标计算资源数目b小于可切分轴2对应的目标计算资源数目c,那么选择可切分轴1作为目标切分轴,并且可切分轴1对应的目标计算资源数目为b。As a possible implementation, the target splitting axis is determined according to the computing time required to complete the computing task by the splitting method corresponding to each splittable axis and the target amount of computing resources K. For example, if the calculation time corresponding to the computing task is completed by using the split method corresponding to the splittable axis 1 and b target computing resources, and the splitting method corresponding to the splittable axis 2 and c target computing resources is used to complete the computing task corresponding to The calculation time is the same, but the number of target computing resources b corresponding to splittable axis 1 is less than the target number of computing resources c corresponding to splittable axis 2, then select splittable axis 1 as the target splitting axis, and splittable axis The number of target computing resources corresponding to 1 is b.
S402,根据第一算子的切分信息,确定目标切分轴在第一算子中的轴类型对应的切分方式。S402. According to the segmentation information of the first operator, determine the segmentation mode corresponding to the axis type of the target segmentation axis in the first operator.
S403,根据目标切分轴在第一算子中的轴类型对应的切分方式,对第一算子的输入张量进行切分,获得K组输入张量。S403. According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, segment the input tensor of the first operator to obtain K groups of input tensors.
作为一种可能的实现方式,根据目标切分轴在第一算子中的轴类型对应的切分方式,确定第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数。As a possible implementation, according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, determine the Q first input tensors including the target segmentation axis and the target segmentation axis in the first operator The position of the sub-axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.
应理解,Q个第一输入张量为包括目标切分轴的第一算子的输入张量。It should be understood that the Q first input tensors are input tensors of the first operator including the target split axis.
应理解,目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置表示目标切分轴在每个第一输入张量中的第几根轴上,例如,目标切分轴在第1个第一输入张量的0轴和在第2个第一输入张量的0轴上。It should be understood that the position of the target splitting axis in each of the Q first input tensors indicates which axis the target splitting axis is on in each of the first input tensors, for example, the target splitting axis is at 1 on the 0 axis of the first input tensor and on the 0 axis of the 2nd first input tensor.
作为一种可能的实现方式,根据目标切分轴在第一算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中Q组第二输入张量中的每组第二输入张量包括K个第二输入张量。As a possible implementation, according to the axis type of the target segmentation axis in the first operator and the number K of target computing resources, each of the Q first input tensors is respectively segmented, Q groups of second input tensors are obtained, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors.
应理解,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q。It should be understood that the qth group of second input tensors in the Q group of second input tensors is the segmentation result of the qth first input tensor in the Q first input tensors divided into K pieces, where q=1,... ,Q.
应理解,当目标计算资源的数量为K时,每个包括目标切分轴的第一输入张量,都会按照目标切分轴进行切分,切成K个第二输入张量,K个第二输入张量分别作为K个目标计算资源的输入张量,因此,当包括目标切分轴的第一输入张量有Q个时,会形成Q组第二输入张量。It should be understood that when the number of target computing resources is K, each first input tensor including the target splitting axis will be split according to the target splitting axis, and cut into K second input tensors, and K second input tensors The two input tensors are respectively used as the input tensors of the K target computing resources. Therefore, when there are Q first input tensors including the target split axis, Q groups of second input tensors will be formed.
应理解,目标切分轴的轴类型可以为元素(elementwise)轴、滑动窗口(sliding window)轴和规约(reduce)轴,下面将结合图7至图13对这几种轴类型的切分方式做详细说明It should be understood that the axis type of the target segmentation axis can be an elementwise axis, a sliding window axis, and a reduce axis, and the segmentation methods of these axis types will be described below in conjunction with Figures 7 to 13 explain in detail
作为一种可能的实现方式,根据Q组第二输入张量和未切分的第一算子的输入张量, 获得K组输入张量。As a possible implementation manner, K sets of input tensors are obtained according to Q sets of second input tensors and unsegmented input tensors of the first operator.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
应理解,K组输入张量中每组输入张量包括作为共享数据的未切分的第一算子的输入张量和每个目标计算资源对应的切分后的第一算子的第二输入张量。It should be understood that each set of input tensors in the K groups of input tensors includes the unsegmented input tensors of the first operator as shared data and the second input tensors of the first operator after segmentation corresponding to each target computing resource. quantity.
图5是本申请实施例提供的另一种处理计算任务方法流程示意图。图5是对S303的另一种可能的实现方式的具体说明。Fig. 5 is a schematic flowchart of another method for processing computing tasks provided by the embodiment of the present application. FIG. 5 is a specific description of another possible implementation of S303.
当用于执行计算任务的算子还包括第二算子时,第二算子包括P个可切分轴,P个可切分轴为N个可切分轴的子集。When the operator for executing the calculation task further includes a second operator, the second operator includes P slicable axes, and the P slicable axes are a subset of the N slicable axes.
S501,从算子切分信息库中获取第二算子的切分信息,第二算子的切分信息包括P个可切分轴中的第p个可切分轴在第二算子中的轴类型和第二位置信息,其中,第二位置信息用于指示第p个切分轴在第二算子的输入张量中的位置,第二算子的输入张量为第一算子的输出张量,其中,P为大于或等于1且小于或等于N的的正整数,p=1,…,P。S501. Obtain the segmentation information of the second operator from the operator segmentation information library, where the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes in the second operator The axis type and the second position information, where the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, and the input tensor of the second operator is the output of the first operator Tensor, wherein, P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P.
应理解,P个可切分轴为N个可切分轴的子集表示,P个可切分轴出现在第一算子的输出张量中,并且第一算子的输出张量作为第二算子的输入张量。也就是第二算子的P个可切分轴在第一算子的N个可切分轴中同样也出现了。It should be understood that the P splittable axes are a subset representation of the N splittable axes, and the P splittable axes appear in the output tensor of the first operator, and the output tensor of the first operator is used as the second operator The sub's input tensor. That is, the P divisible axes of the second operator also appear in the N divisible axes of the first operator.
S502,根据第一算子的切分信息和第二算子的切分信息,确定P个切分参考信息,P个切分参考信息中的第p个切分参考信息包括:第p个可切分轴在第一算子中的轴类型、第p个可切分轴在第二算子中的轴类型、第p个可切分轴在第一算子的输入张量中的位置。S502. According to the segmentation information of the first operator and the segmentation information of the second operator, determine P pieces of segmentation reference information, and the p-th segmentation reference information among the P segmentation reference information includes: The axis type of the slicing axis in the first operator, the axis type of the p-th slicable axis in the second operator, and the position of the p-th slicable axis in the input tensor of the first operator.
S503,根据P个切分参考信息和计算资源数量M,确定P组候选切分方式,其中,P组候选切分方式中的第p组候选切分方式包括至少一个切分方式。S503. Determine P groups of candidate segmentation methods according to the P pieces of segmentation reference information and the number M of computing resources, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method.
其中,第p组候选切分方式中包括的切分方式是根据P个切分参考信息中的第p个切分参考信息和计算资源数量M确定的。Wherein, the segmentation methods included in the p-th group of candidate segmentation methods are determined according to the p-th segmentation reference information among the P segmentation reference information and the amount M of computing resources.
应理解,每组候选切分方式是每个切分参考信息对应的候选切分方式,也就是P个可切分轴中每个可切分轴对应的切分参考信息。每组候选切分方式中包括至少一个切分方式还可以理解为每组候选切分方式中包括M-1个切分方式,例如,计算资源数量为4时,目标计算资源数量可以为2、3、4,也就是有3种目标计算资源的数量,因此,每组候选切分方式包括3个切分方式。It should be understood that each group of candidate segmentation methods is a candidate segmentation method corresponding to each segmentation reference information, that is, the segmentation reference information corresponding to each slicable axis among the P slicable axes. Including at least one segmentation method in each group of candidate segmentation methods can also be understood as including M-1 segmentation methods in each group of candidate segmentation methods. For example, when the number of computing resources is 4, the target number of computing resources can be 2, 3, 4, that is, there are 3 types of target computing resource quantities, therefore, each set of candidate segmentation methods includes 3 segmentation methods.
S504,根据P组候选切分方式中的每个切分方式完成计算任务需要的时间,确定目标切分方式。S504. Determine a target segmentation method according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task.
作为一种可能的实现方式,将P组候选切分方式中完成计算任务需要的时间最短的切分方式确定为目标切分方式。As a possible implementation, the segmentation method that takes the shortest time to complete the computing task among the P group of candidate segmentation methods is determined as the target segmentation method.
具体地,当P组候选切分方式中的切分方式总数不多时,遍历P组候选切分方式,来获得所有候选切分方式中完成计算任务需要的时间,选择完成计算任务需要的时间最短的切分方式作为目标切分方式,其中,遍历的方式可以为通过仿真、理论计算或者在实际硬件上运行,本申请实施例对遍历的方式不作限定。Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is not large, traverse the P group of candidate segmentation methods to obtain the time required to complete the computing task in all candidate segmentation methods, and select the shortest time required to complete the computing task The segmentation method of is used as the target segmentation method. The traversal method can be through simulation, theoretical calculation, or running on actual hardware. The embodiment of the present application does not limit the traversal method.
具体地,当P组候选切分方式中的切分方式总数较多时,从P组候选切分方式中搜索出目标切分方式,其中,搜索的方式有多种,可以为蒙特卡洛马尔科夫算法或者遗传算法等,本申请实施例对搜索的方式不作限定。Specifically, when the total number of segmentation methods in the P group of candidate segmentation methods is large, the target segmentation method is searched out from the P group of candidate segmentation methods, where there are multiple search methods, which can be Monte Carlo Marko A husband algorithm or a genetic algorithm, etc., the embodiment of the present application does not limit the search method.
作为一种可能的实现方式,根据P组候选切分方式中的每个切分方式完成计算任务需要的时间以及目标计算资源数量K,确定目标切分方式。例如,如果采用切分方式1和d个目标计算资源来完成计算任务对应的运算时间和切分方式2和e个目标计算资源来完成计算任务对应的运算时间相同,但是切分方式1对应的目标计算资源数目d小于切分方式对应的目标计算资源数目e,那么选择切分方式1作为目标切分方式,并且目标切分方式对应的目标计算资源数目为d。As a possible implementation, the target segmentation method is determined according to the time required for each segmentation method in the P group of candidate segmentation methods to complete the computing task and the target computing resource quantity K. For example, if the calculation time corresponding to the calculation task is completed using the segmentation method 1 and d target computing resources and the computing time corresponding to the calculation task is the same as the calculation time corresponding to the segmentation method 2 and e target computing resources, but the segmentation method 1 corresponds to If the target number of computing resources d is less than the target number of computing resources e corresponding to the segmentation method, then select segmentation method 1 as the target segmentation method, and the target number of computing resources corresponding to the target segmentation method is d.
S505,根据目标切分方式,对第一算子的输入张量进行切分,获得K组输入张量。S505. Segment the input tensor of the first operator according to the target segmentation manner to obtain K groups of input tensors.
图6是本申请实施例提供的一种多算子完成计算任务的对应的算子切分方式流程示意图,下面将结合图6对S505做具体说明。FIG. 6 is a schematic flowchart of a corresponding operator segmentation method for multi-operators to complete computing tasks provided by an embodiment of the present application. S505 will be specifically described below in conjunction with FIG. 6 .
S601,根据目标切分方式,确定目标切分轴、目标切分轴在第一算子中的轴类型、目标切分轴在第二算子中的轴类型、第一算子中包括目标切分轴的Q个第一输入张量以及目标切分轴在Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数。S601. According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, and the target segmentation axis included in the first operator. The Q first input tensors of the split axis and the position of the target split axis in each of the Q first input tensors, where Q is a positive integer greater than or equal to 1.
应理解,S601中对Q个第一输入张量的解释和S402中类似,为了简洁,具体可以参考S402中的描述,在此不作赘述。It should be understood that the explanation of the Q first input tensors in S601 is similar to that in S402. For brevity, reference may be made to the description in S402 for details, and details are not repeated here.
S602,根据目标切分轴在第一算子中的轴类型和目标切分轴在第二算子中的轴类型和目标计算资源的数量K,分别对Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,Q组第二输入张量中的第q组第二输入张量是Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q。S602. According to the axis type of the target splitting axis in the first operator, the axis type of the target splitting axis in the second operator, and the number K of target computing resources, separately calculate each of the Q first input tensors Segment an input tensor to obtain Q groups of second input tensors, wherein each group of second input tensors in Q groups of second input tensors includes K second input tensors, and the qth group of Q groups of second input tensors The group of second input tensors is the segmentation result of the qth first input tensor among the Q first input tensors being divided into K pieces, where q=1,...,Q.
应理解,S602中对Q组第二输入张量的解释和S403中类似,为了简洁,具体可以参考S403中的描述,在此不作赘述。It should be understood that the explanation of the Q group of second input tensors in S602 is similar to that in S403. For brevity, reference may be made to the description in S403 for details, and details are not repeated here.
需要说明的是,S602中Q组第二输入张量的获得需要基于目标切分轴在第一算子中的轴类型和在第二算子中的轴类型,具体切分方式将结合图15做举例说明。It should be noted that the acquisition of the Q group of second input tensors in S602 needs to be based on the axis type of the target segmentation axis in the first operator and the axis type in the second operator. The specific segmentation method will be combined with Figure 15 Give an example.
S603,根据Q组第二输入张量和未切分的第一算子的输入张量,获得K组输入张量。S603. Obtain K sets of input tensors according to Q sets of second input tensors and the unsegmented input tensors of the first operator.
其中,K组输入张量中的第k组输入张量中包括Q组第二输入张量中每组第二输入张量中的第k个第二输入张量和未切分的第一算子的输入张量。Wherein, the kth group of input tensors in the K group of input tensors includes the kth second input tensor in each group of second input tensors in the Q group of second input tensors and the undivided input tensor of the first operator.
应理解,S603中对K组第二输入张量的解释和S404中类似,为了简洁,具体可以参考S404中的描述,在此不作赘述。It should be understood that the explanation of the K sets of second input tensors in S603 is similar to that in S404. For brevity, reference may be made to the description in S404, and details are not repeated here.
需要说明的是,完成计算任务还可以包括更多的算子,在本申请实施例中以完成计算任务包括第一算子和第二算子为例进行详细说明,当完成计算任务还需要除第一算子和第二算子以外的算子时,图优化器还需要获取其他算子的切分信息,从而获得候选切分方式,以便确定目标切分方式。It should be noted that more operators may be included in the completion of the calculation task. In the embodiment of this application, the completion of the calculation task includes the first operator and the second operator as an example for detailed description. When the calculation task is completed, it is necessary to divide For operators other than the first operator and the second operator, the graph optimizer also needs to obtain the segmentation information of other operators, so as to obtain candidate segmentation methods, so as to determine the target segmentation method.
下面将结合图7至图17对本申请实施例中的算子输入张量的轴类型、可切分轴在算子的输入张量和输出张量的位置信息以及不同轴类型对应的算子切分方式进行详细具体说明。The axis type of the input tensor of the operator in the embodiment of this application, the location information of the input tensor and the output tensor of the operator that can be divided into axes, and the operators corresponding to different axis types will be described below in conjunction with Figures 7 to 17 The segmentation method is described in detail.
轴类型为算子输入张量和输出之间的数据依赖关系,也就是图优化器可以根据输入张量的轴类型,确定轴类型对应的切分方式。因此,当不同的算子输入包括相同的轴类型,可以有相同的算子切分方式。The axis type is the data dependency between the input tensor and the output of the operator, that is, the graph optimizer can determine the splitting method corresponding to the axis type according to the axis type of the input tensor. Therefore, when different operator inputs include the same axis type, they can have the same operator splitting method.
作为一种可能实现的方式,算子输入张量的轴类型可以包括元素轴、规约轴、滑动窗口轴等可切分的轴,还可以包括其他可切分轴的类型,本申请实施例对此不作限制。As a possible implementation, the axis type of the operator input tensor may include divisible axes such as element axes, reduction axes, and sliding window axes, and may also include other types of divisible axes. This is not limited.
下面将结合图7至图13对元素轴、归约轴以及滑动窗口轴做具体说明。需要说明的是图7至图13均为单算子完成计算任务对应的算子切分方式示意图,在图7至图13中的算子A、算子B、算子C均可以表示第一算子,本申请实施例对第一算子的名称不作限制。The element axis, the reduction axis and the sliding window axis will be described in detail below with reference to FIGS. 7 to 13 . It should be noted that Figure 7 to Figure 13 are schematic diagrams of the operator segmentation method corresponding to a single operator completing a computing task, and Operator A, Operator B, and Operator C in Figure 7 to Figure 13 can all represent the first Operator, the embodiment of this application does not limit the name of the first operator.
元素(elementwise)轴:如果算子A的输入张量中某个迭代变量是元素轴,那么元素轴是算子A的输入张量和输出张量中的元素具有是点对点的映射关系的轴,也就是输出张量中的点与输出张量所依赖的输入张量的点在该轴上的位置相同。例如,输入张量的形状为(5,7,9,3)的四维张量,其中输入张量的3轴的长度为3,对于输入张量的3轴而言包括数据a0、a1和a2,输出张量的形状为(4,6,8,3),其中输出张量的3轴的长度为3,对于输出张量的3轴而言包括数据b0、b1和b2,其中a0和b0的位置对应、a1和b1的位置对应、a2和b2的位置对应,那么输入张量和输出张量的3轴的轴类型是元素轴。Element (elementwise) axis: If an iteration variable in the input tensor of operator A is an element axis, then the element axis is the axis in which the elements in the input tensor and output tensor of operator A have a point-to-point mapping relationship, that is, in the output tensor The points of are on the same axis as the points of the input tensors that the output tensor depends on. For example, the input tensor is a 4D tensor of shape (5,7,9,3), where the 3-axis of the input tensor has a length of 3, and includes the data a0, a1, and a2 for the 3-axis of the input tensor , the shape of the output tensor is (4,6,8,3), where the length of the 3-axis of the output tensor is 3, including data b0, b1 and b2 for the 3-axis of the output tensor, where a0 and b0 The positions of a1 and b1 are corresponding, and the positions of a2 and b2 are corresponding, then the axis type of the 3-axis of the input tensor and the output tensor is the element axis.
图7是本申请实施例提供的一种元素轴的切分方式示意图。对算子A输入张量的按照元素轴进行切分的步骤如图7所示。在图7中,以算子A为激活函数算子进行举例说明。本申请实施例对算子A的类型不作限制。需要说明的是,图7中的激活函数算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。FIG. 7 is a schematic diagram of a splitting method of an element axis provided in an embodiment of the present application. The steps of splitting the input tensor of operator A according to the element axis are shown in Figure 7. In FIG. 7 , operator A is used as an example of an activation function operator for illustration. The embodiment of this application does not limit the type of operator A. It should be noted that the input tensor and output tensor of the activation function operator in Figure 7 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.
具体地,激活函数算子中目标切分轴的类型为元素轴,根据目标切分轴在激活函数算子中的位置信息,可以确定目标切分轴类型为元素轴出现在激活函数算子的第一输入张量的0轴上,也就是长度为8的0轴是元素轴。根据第一输入张量元素轴的长度,通过第一输入张量和第一输出张量元素轴的正向形状推导函数y=f 1(x),获得第一输出张量元素轴的长度,其中x代表第一输入张量元素轴的长度,y代表第一输出张量元素轴的长度。其中,元素轴的正向推导逻辑是第一输入张量和第一输出张量的元素轴的长度相等。如图7的(a)所示,激活函数算子的第一输入张量是(8,56,56,64),第一输入张量的0轴是元素轴,根据输出张量元素轴的长度和输入张量元素轴的长度相等的逻辑,第一输出张量0轴的长度也为8,也就是第一输出张量为(8,56,56,64)。 Specifically, the type of the target segmentation axis in the activation function operator is the element axis, and according to the position information of the target segmentation axis in the activation function operator, it can be determined that the type of the target segmentation axis is the element axis that appears in the activation function operator The 0-axis of the first input tensor, that is, the 0-axis of length 8 is the element axis. According to the length of the first input tensor element axis, the forward shape derivation function y=f 1 (x) by the first input tensor and the first output tensor element axis obtains the length of the first output tensor element axis, where x represents the length of the first input tensor element axis and y represents the length of the first output tensor element axis. Wherein, the forward derivation logic of the element axis is that the lengths of the element axes of the first input tensor and the first output tensor are equal. As shown in (a) of Figure 7, the first input tensor of the activation function operator is (8, 56, 56, 64), the 0 axis of the first input tensor is the element axis, according to the output tensor element axis The logic that the length is equal to the length of the input tensor element axis, the length of the first output tensor 0 axis is also 8, that is, the first output tensor is (8,56,56,64).
根据目标计算资源的数量,对第一输出张量按照元素轴进行切分,得到每个目标计算资源上激活函数算子的第二输出张量。随后,根据每个计算资源上的算子的第二输出张量对应的元素轴长度,通过元素轴的反向形状推导函数
Figure PCTCN2022074576-appb-000001
反向推导出每个第二输入张量的元素轴长度。在对第一输入张量按照元素轴进行切分时需要使用切分函数(split),在不同计算资源运算结束后需要将不同目标计算资源上的第二输出张量通过拼接(concat)函数,来获得第一输出张量。
According to the number of target computing resources, the first output tensor is segmented according to the element axis to obtain the second output tensor of the activation function operator on each target computing resource. Then, according to the length of the element axis corresponding to the second output tensor of the operator on each computing resource, the function is derived through the reverse shape of the element axis
Figure PCTCN2022074576-appb-000001
Reverse deduces the element-wise axis lengths of each second input tensor. When splitting the first input tensor according to the element axis, the split function (split) needs to be used. After the operation of different computing resources is completed, the second output tensor on different target computing resources needs to be spliced through the concat function. to get the first output tensor.
如图7的(b)所示,有两个目标计算资源用于激活函数算子运算,第一输出张量的0轴长度为8,因此每个计算资源上的第二输出张量的元素轴的长度为4,也就是第1个第二输出张量为(4,56,56,64),第2个第二输出张量为(4,56,56,64),第二输出张量和第一输出张量之间使用拼接函数来同步数据。其中,第1个第二输出张量0轴上的元素和第2个第二输出张量0轴上的元素相互之间没有交集。随后,根据元素轴的反向形状推导函数,反向推导出第二输入张量的元素轴的长度为4,也就是第1个第二输入张量为(4,56,56,64),第2个第二输入张量为(4,56,56,64),因此,为了得到第二输入张量的 形状,图优化器通过调用第一切分函数,对第一输入张量按照元素轴进行切分,也就是按照第一输入张量的0轴进行切分,获得两个第二输入张量。As shown in (b) of Figure 7, there are two target computing resources for the activation function operator operation, the 0-axis length of the first output tensor is 8, so the elements of the second output tensor on each computing resource The length of the axis is 4, that is, the first second output tensor is (4,56,56,64), the second second output tensor is (4,56,56,64), and the second output tensor is The concatenation function is used between the tensor and the first output tensor to synchronize the data. Among them, there is no intersection between the elements on the 0-axis of the first second output tensor and the elements on the 0-axis of the second second output tensor. Then, according to the inverse shape derivation function of the element axis, the length of the element axis of the second input tensor is reversely deduced to be 4, that is, the first second input tensor is (4,56,56,64), The second second input tensor is (4,56,56,64), therefore, in order to get the shape of the second input tensor, the graph optimizer divides the first input tensor by element by calling the first slicing function Axis splitting, that is, splitting according to the 0-axis of the first input tensor to obtain two second input tensors.
需要说明的是,图7的(b)中所示的两个第二输出张量中的0轴的长度是相等的,两个第二输入张量中0轴的长度也是相等的,不同目标计算资源对应的第二输入张量仅需要满足如下条件即可:两个第二输入张量的0轴对应的元素无重叠部分,也就是两个第二输入张量的0轴对应的元素为第一输入张量的0轴对应的元素的子集,并且无交集,另外,两个第二输入张量的0轴上对应的元素的并集为第一输入张量的0轴上对应的元素。本申请实施例对不同计算资源上获得的第二输入张量的0轴对应的长度是否相等不作限定。It should be noted that the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 7 are equal, and the lengths of the 0-axis in the two second input tensors are also equal. The second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection. In addition, the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor. The embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.
规约(reduce)轴:如果算子B的输入张量中某个迭代变量是规约轴,那么规约轴是算子的输入张量中有,而算子的输出张量中没有或长度为1的轴。Reduce axis: If an iteration variable in the input tensor of operator B is a reduce axis, then the reduce axis is an axis that exists in the input tensor of the operator but does not exist in the output tensor of the operator or has a length of 1.
具体地,对于规约轴还可以分为两类,第一类规约轴是算子B对输入张量中的元素进行缩减操作的规约轴。例如,算子B的输入张量的形状是(2,3,4,5),其中输入张量的0轴是规约轴,长度为2,那么输入张量经过算子B的运算之后,得到的输出张量的形状是(,3,4,5)或(1,3,4,5)。Specifically, the reduction axis can be further divided into two types. The first type of reduction axis is the reduction axis on which the operator B performs a reduction operation on the elements in the input tensor. For example, the shape of the input tensor of operator B is (2,3,4,5), where the 0-axis of the input tensor is the reduction axis and the length is 2, then after the input tensor is operated by operator B, we get The shape of the output tensor is (,3,4,5) or (1,3,4,5).
第二类规约轴是算子B不对输入张量中的元素进行缩减操作的规约轴,虽然算子B不对第二类规约轴上的元素进行缩减操作,但是同样不出现在输出张量中,但是出现在输入张量中。例如,规约采集轴,具体关于规约采集轴将结合图11具体说明The second type of reduction axis is the reduction axis that operator B does not perform reduction operations on the elements in the input tensor. Although operator B does not perform reduction operations on the elements on the second type of reduction axis, it does not appear in the output tensor, but it appears in the input in tensor. For example, the protocol acquisition axis, specifically about the protocol acquisition axis will be described in detail in conjunction with Figure 11
第一类规约轴可以包括规约之和(reduceSum)轴、规约最大值(reduceMax)轴、规约最小值(reduceMin)轴和规约平均值(reduceMean)轴等。需要说明的是,这些不同类型的规约轴都具有规约轴的通用特点,不同的是,切分后的第一输入张量在不同目标计算资源上经过算子B之后为了得到切分前等价的第一输出张量所需要调用的函数类型不同,下面将结合图8至图10具体说明,不同类型的第一类规约轴的具体切分方式。The first type of reduction axes may include a reduction sum (reduceSum) axis, a reduction maximum value (reduceMax) axis, a reduction minimum value (reduceMin) axis, a reduction mean value (reduceMean) axis, and the like. It should be noted that these different types of reduction axes all have the general characteristics of reduction axes. The difference is that the first input tensor after splitting passes through operator B on different target computing resources in order to obtain the equivalent before splitting The types of functions that need to be called for the first output tensor are different. The specific segmentation methods of different types of first-type reduction axes will be described in detail below in conjunction with FIG. 8 to FIG. 10 .
图8是本申请实施例提供的一种规约之和轴的切分方式示意图。对算子B的第一输入张量中的规约之和轴进行切分的步骤如图8所示。在图8中,以算子B为整合之和算子进行举例说明。本申请实施例对算子B的类型不作限制。需要说明的是,图8中的整合之和算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。FIG. 8 is a schematic diagram of a splitting method of a reduced sum axis provided by an embodiment of the present application. The steps of splitting the reduced sum axis in the first input tensor of operator B are shown in FIG. 8 . In FIG. 8 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated sum operator in Figure 8 are described by taking a single input tensor and a single output tensor as an example. In the embodiment of the present application, the input tensor and The number of output tensors is not limited.
具体地,整合之和算子中目标切分轴的类型为规约之和轴,根据目标切分轴在整合之和算子中的位置信息,可以确定目标切分轴为规约之和轴出现在整合之和算子的第一输入张量的0轴上。如图8的(a)所示,整合之和算子的第一输入张量为(8,56,56,64),第一输入张量的0轴的轴类型是规约之和轴,也就是长度为8的0轴是规约之和轴。因此,根据规约之和轴的特点,第一输出张量规约之和轴的长度为1,也就是第一输出张量为(,56,56,64)。Specifically, the type of the target split axis in the integrated sum operator is the reduced sum axis, and according to the position information of the target split axis in the integrated sum operator, it can be determined that the target split axis is the reduced sum axis that appears in Integrate the 0 axis of the first input tensor of the sum operator. As shown in (a) of Figure 8, the first input tensor of the integrated sum operator is (8, 56, 56, 64), and the axis type of the 0 axis of the first input tensor is the reduced sum axis, also That is, the 0 axis with a length of 8 is the sum axis of the statute. Therefore, according to the characteristics of the reduced sum axis, the length of the reduced sum axis of the first output tensor is 1, that is, the first output tensor is (, 56, 56, 64).
根据目标计算资源的数量和第一输出张量中的规约之和轴的长度,通过调用第三切分函数,将第一输入张量按照规约之和轴切分成两个第二输入张量,将两个第二输入张量发送到两个目标计算资源上进行运算,得到两个第二输出张量,第1个第二输出张量为(,56,56,64),第2个第二输出张量为(,56,56,64),通过调用相加(AddN)函数,将两个目标计算资源上的数据进行同步得到第一输出张量。如图8的(b)所示,有两个可用的计算资源用于整合之和算子运算,第一输出张量的规约之和轴的长度为1,每个计算资 源上算子的第二输出张量通过相加算子得到第一输出张量,第二输出张量的形状和第一输出张量的形状相同,均为(,56,56,64)。由于有两个计算资源,因此,通过切分算子对第一输入张量的规约之和轴进行切分,得到第二输入张量,其中第二输入张量的规约之和轴的长度为4,也就是第二输入张量的形状为(4,56,56,64)。According to the number of target computing resources and the length of the sum axis in the first output tensor, by calling the third split function, the first input tensor is divided into two second input tensors according to the sum axis of the norm, and the two The second input tensor is sent to two target computing resources for operation, and two second output tensors are obtained. The first second output tensor is (, 56, 56, 64), and the second second output tensor The tensor is (, 56, 56, 64), and the first output tensor is obtained by synchronizing the data on the two target computing resources by calling the add (AddN) function. As shown in (b) of Figure 8, there are two available computing resources for integrating the sum operator operation, the length of the reduced sum axis of the first output tensor is 1, and the first output tensor of the operator on each computing resource The two output tensors are added to obtain the first output tensor, and the shape of the second output tensor is the same as that of the first output tensor, both are (, 56, 56, 64). Since there are two computing resources, the reduction sum axis of the first input tensor is divided by the segmentation operator to obtain the second input tensor, wherein the length of the reduction sum axis of the second input tensor is 4, that is, the shape of the second input tensor is (4,56,56,64).
需要说明的是,图8的(b)中所示的两个第二输出张量中的0轴的长度是相等的,两个第二输入张量中0轴的长度也是相等的,不同目标计算资源对应的第二输入张量仅需要满足如下条件即可:两个第二输入张量的0轴对应的元素无重叠部分,也就是两个第二输入张量的0轴对应的元素为第一输入张量的0轴对应的元素的子集,并且无交集,另外,两个第二输入张量的0轴上对应的元素的并集为第一输入张量的0轴上对应的元素。本申请实施例对不同计算资源上获得的第二输入张量的0轴对应的长度是否相等不作限定。It should be noted that the lengths of the 0-axis in the two second output tensors shown in (b) of FIG. 8 are equal, and the lengths of the 0-axis in the two second input tensors are also equal. The second input tensor only needs to meet the following conditions: the elements corresponding to the 0-axis of the two second input tensors have no overlap, that is, the elements corresponding to the 0-axis of the two second input tensors are the first input tensor A subset of the elements corresponding to the 0-axis of the quantity, and there is no intersection. In addition, the union of the corresponding elements on the 0-axis of the two second input tensors is the corresponding element on the 0-axis of the first input tensor. The embodiment of the present application does not limit whether the lengths corresponding to the 0-axis of the second input tensor obtained on different computing resources are equal.
图9是本申请实施例提供的一种规约最大值轴的切分方式示意图。对算子B的第一输入张量的规约最大值轴进行切分的步骤如图9所示。在图9中,以算子B为整合最大值算子进行举例说明。本申请实施例对算子B的类型不作限制。需要说明的是,图9中的整合最大值算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。FIG. 9 is a schematic diagram of a splitting method of a reduced maximum value axis provided by an embodiment of the present application. The steps of dividing the reduced maximum value axis of the first input tensor of operator B are shown in FIG. 9 . In FIG. 9 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated maximum operator in Figure 9 are described using a single input tensor and a single output tensor as an example. The number of output tensors is not limited.
对于包括规约最大值轴的第一输入张量而言,在对整合最大值算子的第一输入张量进行切分时,主要步骤和图8中对整合之和算子的第一输入张量的规约之和轴进行切分的总体相同,在此可参照对图8中第一输入张量按照规约之和轴切分步骤的说明,在此不作赘述。For the first input tensor including the reduced maximum axis, when splitting the first input tensor of the integrated maximum operator, the main steps and the first input tensor of the integrated sum operator in Figure 8 The division of the reduced sum axis of the quantity is generally the same. Here, reference may be made to the description of the steps of dividing the first input tensor according to the reduced sum axis in FIG. 8 , and details are not repeated here.
需要说明的是,如图9所示,第一输入张量规约之和轴切分之后经过算子B和第一输入张量规约最大值轴切分之后经过算子B所需要调用的函数类型不同,对不同目标计算资源上的第二输入张量进行整合之和算子运算的第二输出张量,通过调用相加函数进行数据同步,得到第一输出张量,对不同计算资源上的第二输入张量进行最大值算子运算,通过调用最大值函数进行数据同步,得到第一输出张量。It should be noted that, as shown in Figure 9, the type of function that needs to be called by operator B after the sum axis segmentation of the first input tensor specification and the maximum axis segmentation of the first input tensor specification Different, the second output tensor of the integrated sum operator operation is performed on the second input tensor on different target computing resources, and the first output tensor is obtained by calling the addition function for data synchronization. Perform the maximum value operator operation on the second input tensor, and perform data synchronization by calling the maximum value function to obtain the first output tensor.
对于包括规约最小值轴的第一输入张量而言,在对整合最小值算子的第一输入张量按照规约最小值轴进行切分时,主要步骤和图8中对整合之和算子的第一输入张量按照规约之和轴进行切分的总体相同,在此可参照对图8中第一输入张量按照规约之和轴切分步骤的说明,在此不作赘述。For the first input tensor including the reduced minimum value axis, when the first input tensor of the integrated minimum value operator is divided according to the reduced minimum value axis, the main steps and the integrated sum operator in Figure 8 The division of the first input tensor according to the sum axis of the stipulation is generally the same. Here, reference may be made to the description of the steps of slicing the first input tensor according to the sum axis of the stipulation in FIG. 8 , and details are not repeated here.
需要说明的是,第一输入张量规约之和轴切分之后经过算子B和第一输入张量规约最小值轴切分之后经过算子B所需要调用的函数类型不同,对不同目标计算资源上的第二输入张量进行整合之和算子运算的第二输出张量,通过调用相加函数进行数据同步,得到第一输出张量,对不同计算资源上的第二输入张量进行最小值算子运算,通过调用最小值函数进行数据同步,得到第一输出张量。It should be noted that the types of functions that need to be called after passing through operator B after the sum axis segmentation of the first input tensor specification and the minimum value axis segmentation of the first input tensor specification are different, and the calculations for different targets The second input tensor on the resource is integrated into the second output tensor of the sum operator operation, and the data synchronization is performed by calling the addition function to obtain the first output tensor, and the second input tensor on different computing resources is processed In the minimum value operator operation, data synchronization is performed by calling the minimum value function to obtain the first output tensor.
图10是本申请实施例提供的一种规约平均值轴的切分方式示意图。对算子B的输入张量按照规约平均值轴进行切分的步骤如图10所示。在图10中,以算子B为规约平均算子进行举例说明。本申请实施例对算子B的类型不作限制。需要说明的是,图10中的整合平均值算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。Fig. 10 is a schematic diagram of a splitting method of a statistic mean axis provided by an embodiment of the present application. The steps of splitting the input tensor of operator B according to the reduced mean axis are shown in Figure 10. In FIG. 10 , operator B is used as an example for illustration. The embodiment of this application does not limit the type of operator B. It should be noted that the input tensor and output tensor of the integrated average operator in Figure 10 are described using a single input tensor and a single output tensor as an example. The number of output tensors is not limited.
对于包括规约平均值轴的第一输入张量而言,在对整合平均值算子的第一输入张量进 行切分时,主要步骤和图8中对整合之和算子的第一输入张量按照规约之和轴进行切分的总体相同,在此可参照对图8中第一输入张量按照规约之和切分步骤的说明,在此不作赘述。For the first input tensor including the reduced mean axis, when splitting the first input tensor of the integrated mean operator, the main steps and the first input tensor of the integrated sum operator in Figure 8 The division of the quantity according to the sum axis of the specification is generally the same. Here, you can refer to the description of the steps of dividing the first input tensor according to the sum of the specification in FIG.
需要说明的是,如图10所示,第一输入张量规约最小值轴切分之后经过算子B和第一输入张量规约之和轴切分之后经过算子B所需要调用的函数个数是不同,对不同目标计算资源上的第二输入张量进行整合平均算子运算的第二输出张量,通过调用相加函数进行数据同步,得到中间输出张量,还需要调用相乘函数,得到第一输出张量。需要说明的是,其中,相加函数是对不同计算资源的第二输出张量的整合平均值轴进行求和的同步节点,相乘函数是对经过求和同步的中间输出张量的整合平均值轴乘以1/group以得到第一输出张量,其中group是目标计算资源的数量,例如,在图10中,目标计算资源是2,那么group为2。It should be noted that, as shown in Figure 10, after the first input tensor specification minimum value axis segmentation, after operator B and the first input tensor specification sum axis segmentation, the number of functions that operator B needs to call The number is different, the second output tensor of the integrated average operator operation is performed on the second input tensor on different target computing resources, and the data synchronization is performed by calling the addition function to obtain the intermediate output tensor, and the multiplication function also needs to be called , to get the first output tensor. It should be noted that the addition function is a synchronization node that sums the integrated average axes of the second output tensors of different computing resources, and the multiplication function is the integrated average of the intermediate output tensors that have been summed and synchronized The value axis is multiplied by 1/group to get the first output tensor, where group is the number of target computing resources, for example, in Figure 10, the target computing resources are 2, then group is 2.
第二类规约轴包括规约采集(reduce-gather)轴,算子B根据算子B的索引输入张量上元素指示的地址在算子B的输入张量的元素上索引数据的轴,也就是当第一输入张量包含规约采集轴时,需要根据第一索引输入张量(indice)上的地址在第一输入张量的规约采集轴中找到相应的数据作为第一输出张量的0轴的数据。图11是本申请实施例提供的一种规约采集轴的切分方式,以算子B为采集(gather2)算子为例进行详细说明,本申请实施例对算子B的类型不作限定。需要说明的是,图11中的采集算子的输入张量是以索引输入张量和一个第一输入张量,采集算子的输出张量是以一个第一输出张量为例进行说明,本申请实施例对算子的第一输入张量和第一输出张量的个数不作限定。The second type of reduction axis includes the reduce-gather axis. Operator B indexes data on the elements of operator B’s input tensor according to the address indicated by the element on the input tensor of operator B’s index, that is, When the first input tensor contains a protocol collection axis, it is necessary to find the corresponding data in the protocol collection axis of the first input tensor according to the address on the first index input tensor (indice) as the 0 axis of the first output tensor The data. Fig. 11 is a method of dividing the collection axis provided by the embodiment of the application. The operator B is taken as the gather (gather2) operator as an example to describe in detail. The embodiment of the application does not limit the type of the operator B. It should be noted that the input tensor of the acquisition operator in Figure 11 is an index input tensor and a first input tensor, and the output tensor of the acquisition operator is explained by taking a first output tensor as an example. The embodiment of the present application does not limit the number of the first input tensor and the first output tensor of the operator.
具体地,如图11的(a)所示,采集算子有两个输入张量,分别是第一输入张量和第一索引输入张量,其中,第一输入张量是数据输入张量,其形状为(80,64),第一索引输入张量是包括索引地址的输入张量,其形状为(20,)。根据采集算子的切分信息,确定目标切分轴为规约采集轴,并且规约采集轴出现在第一输入张量的0轴上,根据规约采集轴轴的特点,第一输出张量0轴的数据根据第一索引输入张量上的索引地址在第一输入张量的0轴中找到相应的数据元素,因此,第一输出张量的形状为(20,64)。Specifically, as shown in (a) of Figure 11, the acquisition operator has two input tensors, namely the first input tensor and the first index input tensor, wherein the first input tensor is the data input tensor , which has shape (80,64), and the first index input tensor is the input tensor including the index address, which has shape (20,). According to the segmentation information of the acquisition operator, determine the target segmentation axis as the protocol acquisition axis, and the protocol acquisition axis appears on the 0 axis of the first input tensor. According to the characteristics of the protocol acquisition axis, the first output tensor is 0 axis The data of finds the corresponding data element in the 0-axis of the first input tensor according to the index address on the first index input tensor, therefore, the shape of the first output tensor is (20,64).
根据目标计算资源的数量和第一输出张量的规约采集轴的长度,通过调用第三切分函数,将第一输出张量按照规约采集轴进行切分,得到两个第二输入张量。每个目标计算资源上分别有对应的第二输入张量,以及通过调用偏置函数对第一索引输入张量进行偏置后的索引输入张量。随后每个目标计算资源经过采集算子,得到各自的第二输出张量,通过调用相加函数,将不同目标计算资源上的第二输出张量相加数据同步,得到第一输出张量。如图11的(b)所示,有两个目标计算资源用于采集算子运算,第一输出张量无规约采集轴,第一输出张量中0轴的长度和第一索引输入张量0轴的长度相等。由于有两个计算资源,因此,通过调用第三切分函数对第一输入张量的规约采集轴进行切分,得到第二输入张量,其中第二输入张量的规约采集轴轴的长度为40,也就是第1个第二输入张量的形状为(40,64),第2个第二输入张量的形状也为(40,64)。每个计算资源在进行采集算子运算时,会获得相同的第一索引输入张量,由于每个计算资源上的采集算子只有获得第一输入张量规约采集轴轴上一半的数据,也就是第1个第二输入张量和第2个第二输入张量,因此第一索引输入张量需要经过偏置算子运算,才能保证经过每个计算资源上的采集算子运算后的第二输出张量的正确性。According to the number of target computing resources and the length of the protocol collection axis of the first output tensor, the first output tensor is segmented according to the protocol collection axis by calling the third segmentation function to obtain two second input tensors. Each target computing resource has a corresponding second input tensor and an index input tensor obtained by biasing the first index input tensor by calling a bias function. Then each target computing resource passes through the collection operator to obtain its own second output tensor. By calling the addition function, the second output tensors on different target computing resources are added and data synchronized to obtain the first output tensor. As shown in (b) of Figure 11, there are two target computing resources for the acquisition operator operation, the first output tensor has no specification acquisition axis, the length of the 0-axis in the first output tensor and the first index input tensor 0-axis are equal in length. Since there are two computing resources, the second input tensor is obtained by calling the third slicing function to split the reduction acquisition axis of the first input tensor, wherein the length of the reduction acquisition axis of the second input tensor is is 40, that is, the shape of the first second input tensor is (40,64), and the shape of the second second input tensor is also (40,64). When each computing resource performs collection operator operations, it will obtain the same first index input tensor. Since the collection operator on each computing resource only obtains half of the data on the collection axis of the first input tensor specification, it is also It is the 1st second input tensor and the 2nd second input tensor, so the first index input tensor needs to go through the bias operator operation to ensure that the first index after the acquisition operator operation on each computing resource The correctness of the second output tensor.
需要说明的是,在进行采集算子运算时,由于第一输入张量被分为两部分,每个计算资源上的采集算子在根据第一索引输入张量上的地址在第二输入张量规约采集轴轴中搜索,以获得第二输出张量0轴上的数据时,会出现数据不存在的情况,此时将0作为搜索结果,最后将两个计算资源上经过采集算子运算的第二输出张量进行相加算子运算,得到第一输出张量。It should be noted that when performing collection operator operations, since the first input tensor is divided into two parts, the address of the collection operator on each computing resource on the input tensor according to the first index is in the second input tensor When searching in the collection axis of the quantity specification to obtain the data on the 0-axis of the second output tensor, there will be a situation where the data does not exist. At this time, 0 is used as the search result, and finally the collection operator operation is performed on the two computing resources. The second output tensor of is subjected to the addition operator operation to obtain the first output tensor.
本申请实施例中,由于规约轴的类型已经将具体的切分方式确定好,因此,在进行图优化的时候,图优化器不需要基于具体的算子的原理,就可以将包括规约轴具体算子的输入张量进行合理切分,相比于目前的算子切分方式,由于传统切分方式都是从具体算子的输出张量进行切分,由于规约轴的特点是不出现在输出张量上或者在输出张量上的长度为1,因此,传统算子切分方式无法对输入张量中的存在规约轴特点的轴进行切分。In the embodiment of this application, since the type of the reduction axis has already determined the specific segmentation method, when performing graph optimization, the graph optimizer does not need to be based on the principle of a specific operator, and can include the specific division of the reduction axis. The input tensor of the operator is reasonably segmented. Compared with the current operator segmentation method, since the traditional segmentation method is based on the output tensor of the specific operator, the characteristic of the statute axis is that it does not appear in the The length on or on the output tensor is 1. Therefore, the traditional operator splitting method cannot split the axis that has the characteristic of the reduced axis in the input tensor.
滑动窗口(sliding window)轴:如果算子C的输入张量中的某个迭代变量是滑动窗口轴,那么滑动窗口轴是算子C对算子C的输入张量中的元素进行滑动窗口扫描操作的轴,如果滑动窗口比步长大,每两次相邻扫描的窗口会出现重叠。Sliding window (sliding window) axis: If an iteration variable in the input tensor of operator C is a sliding window axis, then the sliding window axis is the axis on which operator C performs a sliding window scanning operation on the elements in the input tensor of operator C, If the sliding window is larger than the step, the windows of every two adjacent scans will overlap.
如果将第一输出张量按照滑动窗口轴进行切分,当有两个目标计算资源时,将第一输出张量中的滑动窗口轴对应的元素等分,那么等分后的第一输出张量的滑动窗口轴上有部分数据同时依赖于第一输入张量的滑动窗口轴上相同的数据。因此,对包含滑动窗口轴的第一输入张量的切分有两种切分方式,具体将结合图12和图13具体说明。If the first output tensor is divided according to the sliding window axis, when there are two target computing resources, the elements corresponding to the sliding window axis in the first output tensor are equally divided, then the first output tensor after equal division Some data on the sliding window axis also depends on the same data on the sliding window axis of the first input tensor. Therefore, there are two segmentation methods for the segmentation of the first input tensor including the sliding window axis, which will be specifically described in conjunction with FIG. 12 and FIG. 13 .
需要说明的是,第一输入张量和第一输出张量滑动窗口轴的正向形状推导函数y=f 2(x),也就是根据第一输入张量滑动窗口轴的长度正向推导第一输出张量滑动窗口轴的长度,其中,x代表第一输入张量滑动窗口轴的长度,y代表第一输出张量滑动窗口轴的长度。f 2()和卷积填充值、卷积核大小、卷积步长以及卷积核膨胀系数相关。 It should be noted that the forward shape derivation function y=f 2 (x) of the sliding window axis of the first input tensor and the first output tensor, that is, the forward derivation of the first input tensor based on the length of the sliding window axis The length of the sliding window axis of an output tensor, where x represents the length of the sliding window axis of the first input tensor, and y represents the length of the sliding window axis of the first output tensor. f 2 () is related to convolution filling value, convolution kernel size, convolution step size and convolution kernel expansion coefficient.
第一输入张量和第一输出张量滑动窗口轴的反向形状推导函数
Figure PCTCN2022074576-appb-000002
也就是根据第一输出张量中滑动窗口轴的长度进行反向推导,确定恰当的切分方式,以获得每个计算资源的第二输出张量和第二输入张量。其中,
Figure PCTCN2022074576-appb-000003
同样和卷积填充值、卷积核大小、卷积步长以及卷积核膨胀系数相关。
Inverse shape derivation function for sliding window axes of first input tensor and first output tensor
Figure PCTCN2022074576-appb-000002
That is, reverse derivation is performed according to the length of the sliding window axis in the first output tensor, and an appropriate splitting method is determined to obtain the second output tensor and the second input tensor of each computing resource. in,
Figure PCTCN2022074576-appb-000003
It is also related to convolution filling value, convolution kernel size, convolution step size and convolution kernel expansion coefficient.
图12是本申请实施例提供的一种滑动窗口轴切分方式示意图。对算子C输入张量按照滑动窗口轴进行带重叠的切分方式如图12所示。在图12中,以算子C为卷积算子进行举例说明。本申请实施例对算子C的类型不作限制。需要说明的是,图12中的卷积算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。Fig. 12 is a schematic diagram of a sliding window axis splitting method provided by an embodiment of the present application. The method of splitting the input tensor of operator C with overlap according to the sliding window axis is shown in Figure 12. In FIG. 12 , operator C is used as an example for illustration. The embodiment of this application does not limit the type of operator C. It should be noted that the input tensor and output tensor of the convolution operator in Figure 12 are described using a single input tensor and a single output tensor as an example. The number of tensors is not limited.
具体地,卷积算子中目标切分轴的类型是滑动窗口轴,根据目标切分轴在卷积算子中的位置信息,可以确定目标切分轴为滑动窗口轴出现在卷积算子的第一输入张量的1轴上,也就是长度为56的1轴是滑动窗口轴。因此,根据滑动窗口轴的正向形状推导函数以及卷积算子的第一输入张量中滑动窗口轴的长度,正向推导第一输出张量中滑动窗口轴的长度。如图12的(a)所示,算子第一输入张量是(1,56,56,64),根据第一输入张量和第一输出张量的滑动窗口轴的正向形状推导函数,其中,卷积步长为2,卷积核大小为3,可以得到第一输出张量是(1,28,56,64)。Specifically, the type of the target segmentation axis in the convolution operator is a sliding window axis. According to the position information of the target segmentation axis in the convolution operator, it can be determined that the target segmentation axis is a sliding window axis that appears in the convolution operator The 1-axis of the first input tensor, that is, the 1-axis of length 56 is the sliding window axis. Therefore, according to the forward shape derivation function of the sliding window axis and the length of the sliding window axis in the first input tensor of the convolution operator, the length of the sliding window axis in the first output tensor is forward derived. As shown in (a) of Figure 12, the first input tensor of the operator is (1,56,56,64), and the function is derived according to the forward shape of the sliding window axis of the first input tensor and the first output tensor , where the convolution step size is 2 and the convolution kernel size is 3, the first output tensor can be obtained as (1,28,56,64).
根据目标计算资源的数量K和第一输出张量滑动窗口轴的长度,对第一输出张量按照滑动窗口轴进行切分,得到K个第二输出张量。随后通过滑动窗口轴反向形状推导函数, 反向推导每个目标计算资源上的卷积算子的第二输入张量滑动窗口轴的长度。根据每个第二输出张量中滑动窗口轴的长度,可以通过调用第一切片函数对第一输入张量按照滑动窗口轴进行切分,以获得第二输入张量。在经过不同目标计算资源运算后得到第二输出张量之后,可以通过调用拼接函数,得到和切分前运算的等价的第一输出张量。According to the quantity K of target computing resources and the length of the sliding window axis of the first output tensor, the first output tensor is divided according to the sliding window axis to obtain K second output tensors. Then the function is deduced from the shape inversely by the sliding window axis, and the length of the sliding window axis of the second input tensor of the convolution operator on each target computing resource is reversely derived. According to the length of the sliding window axis in each second output tensor, the first input tensor can be sliced according to the sliding window axis by calling the first slice function to obtain the second input tensor. After the second output tensor is obtained after computing with different target computing resources, the first output tensor equivalent to the operation before splitting can be obtained by calling the splicing function.
如图12的(b)所示,有两个目标计算资源用于卷积算子运算,第一输出张量的1轴为滑动窗口轴,长度为28,因此在通过调用拼接函数之前的每个计算资源上的第二输出张量1轴的长度为14。随后根据滑动窗口轴反向形状推导逻辑,由于卷积步长为2,卷积核大小为3,因此,每个计算资源的第二输入张量的长度为29。随后,通过调用第一切片函数1和第一切片函数2,对第一输入张量按照1轴进行切片,得到两个1轴长度为29的第二输入张量,其中,这两个1轴长度为29的第二输入张量上有重叠的数据,其中一个第二输入张量1轴的数据范围是第一输入张量1轴中从0到28,另一个第二输入张量1轴的数据范围是第一输入张量1轴中从28到56,第一输入张量1轴中第29个数据是两个第二输入张量的重叠部分。As shown in (b) of Figure 12, there are two target computing resources for the convolution operator operation, the first output tensor axis 1 is the sliding window axis, and the length is 28, so each The length of the second output tensor 1-axis on computing resources is 14. Then, the logic is derived according to the reverse shape of the sliding window axis. Since the convolution step size is 2 and the convolution kernel size is 3, the length of the second input tensor of each computing resource is 29. Subsequently, by calling the first slice function 1 and the first slice function 2, the first input tensor is sliced according to the 1-axis, and two second input tensors with a 1-axis length of 29 are obtained, wherein the two There is overlapping data on the second input tensor with a length of 29 on the 1 axis, and the data range of the 1 axis of one of the second input tensors is from 0 to 28 in the 1 axis of the first input tensor, and the other second input tensor The data range of axis 1 is from 28 to 56 in axis 1 of the first input tensor, and the 29th data in axis 1 of the first input tensor is the overlapping part of the two second input tensors.
图12所示的带重叠的切分方式适合切分后的输入张量在不同计算资源上经过算子运算之后,不需要频繁地进行数据同步的场景,例如,多线程并行,不同线程之间完全独立,可以做到流水并行。然而,在一些场景中,切分后的输入张量在不同计算资源上经过运算之后需要频繁地进行数据同步,这会造成不同计算资源上得到的输出张量的重叠部分经过频繁拼接,造成重叠部分不断增大,造成不必要的重复计算。因此,本申请实施还给出另外一种滑动窗口轴不带重叠的切分方式,具体如图13所示。The overlapping segmentation method shown in Figure 12 is suitable for scenarios where the input tensors after segmentation are subjected to operator operations on different computing resources and do not require frequent data synchronization. It is completely independent and can be parallelized. However, in some scenarios, the split input tensors need to be frequently synchronized after being calculated on different computing resources, which will cause the overlapping parts of the output tensors obtained on different computing resources to be spliced frequently, resulting in overlapping The portion keeps growing, causing unnecessary double counting. Therefore, the implementation of this application also provides another splitting method with no overlap of sliding window axes, as shown in FIG. 13 .
图13是本申请实施例提供的另一种滑动窗口轴切分方式示意图。对算子C输入张量的滑动窗口轴进行不带重叠的切分步骤如图13所示。在图13中,以算子C为卷积算子进行举例说明。需要说明的是,和图12一样,图13中的卷积算子的输入张量和输出张量是以单输入张量和单输出张量为例进行说明,本申请实施例对算子的输入张量和输出张量的个数不作限定。Fig. 13 is a schematic diagram of another sliding window axis splitting method provided by the embodiment of the present application. The steps of splitting the sliding window axes of the input tensor of operator C without overlap are shown in Figure 13. In FIG. 13 , operator C is used as an example for illustration. It should be noted that, as in Figure 12, the input tensor and output tensor of the convolution operator in Figure 13 are described using a single input tensor and a single output tensor as an example. The number of input tensors and output tensors is not limited.
具体地,图13中推导出每个目标计算资源的第二输入张量中的滑动窗口轴对应的长度的步骤和图12带重叠的切分步骤一样,具体可以参考图12中的描述,图13和图12不同的是第一输入张量切分得到K个第二输入张量的过程不同。Specifically, the step of deriving the length corresponding to the sliding window axis in the second input tensor of each target computing resource in FIG. 13 is the same as the overlapping segmentation step in FIG. 12. For details, please refer to the description in FIG. The difference in Fig. 12 is that the process of splitting the first input tensor to obtain K second input tensors is different.
具体地,图13的(b)中的第二切分函数是用于将第一输入张量按照滑动窗口轴等分,第二切片函数1和第二切片函数2是用于获得不同目标计算资源算子运算得出的第二输出张量中的滑动窗口轴数据所共同依赖的第二输入张量滑动窗口轴的重叠部分,第一拼接函数1和第一拼接函数2用于将经过第二切分函数、第二切片函数1以及第二切片函数2的第三输入张量和第四输入张量按照滑动窗口轴进行拼接得到作为每个目标计算资源的第二输入张量,第二拼接函数用于将不同目标计算资源上经过卷积算子运算的第二输出张量进行拼接,得到第一输出张量。Specifically, the second slicing function in (b) of Figure 13 is used to equally divide the first input tensor according to the sliding window axis, and the second slicing function 1 and the second slicing function 2 are used to obtain different target calculations The overlapping part of the sliding window axis data of the second input tensor obtained by the operation of the resource operator on which the sliding window axis data in the second output tensor is jointly dependent. The first splicing function 1 and the first splicing function 2 are used to combine the The third input tensor and the fourth input tensor of the sub-function, the second slicing function 1 and the second slicing function 2 are spliced according to the sliding window axis to obtain the second input tensor as each target computing resource, and the second splicing function It is used to splice the second output tensors that have been operated by convolution operators on different target computing resources to obtain the first output tensors.
具体地,如图13的(b)所示,有两个目标计算资源,分别是计算资源1和计算资源2,第一输出张量的形状为(1,28,56,64),其中第一输入张量的1轴为滑动窗口轴。Specifically, as shown in (b) of Figure 13, there are two target computing resources, namely computing resource 1 and computing resource 2, and the shape of the first output tensor is (1, 28, 56, 64), where the first Axis 1 of an input tensor is the sliding window axis.
通过调用第二切分函数,将第一输入张量的按照1轴进行切分,得到两个等分的第三输出张量,它们形状均为(1,28,56,64),分别是第1个第三输入张量和第2个第三输入张量。通过调用第二切片函数1,对第1个第三输入张量按照1轴进行切片,得到第2个 第四输入张量,其形状为(1,1,56,64),第2个第四输入张量在1轴上的数据是第1个第三输入张量在滑动窗口轴上最后一个数据,也就是第一输入张量1轴中的第28个数据。通过调用第二切片函数2,对第2个第三输入张量按照1轴进行切片,得到第1个第四输入张量,其形状为(1,1,56,64),第1个第四输入张量在1轴上的数据是第2个第三输入张量在滑动窗口轴上第一个数据,也就是第一输入张量1轴中的第29个数据。By calling the second slicing function, the first input tensor is segregated according to the 1 axis, and two equally divided third output tensors are obtained, and their shapes are (1, 28, 56, 64), respectively. The 1st third input tensor and the 2nd third input tensor. By calling the second slice function 1, slice the first third input tensor according to the 1 axis to get the second fourth input tensor, whose shape is (1,1,56,64), and the second The data of the four input tensors on axis 1 is the last data of the first and third input tensor on the sliding window axis, that is, the 28th data of the first input tensor on axis 1. By calling the second slice function 2, slice the second third input tensor according to the 1 axis to get the first fourth input tensor, whose shape is (1,1,56,64), and the first The data of the four input tensors on axis 1 is the first data of the second and third input tensor on the sliding window axis, that is, the 29th data of the first input tensor on axis 1.
通过调用第一拼接函数1,将第1个第三输入张量和第1个第四输入张量按照1轴进行拼接,得到第1个第二输入张量,其形状为(1,29,56,64),此第二输入张量1轴的数据范围是第一输入张量1轴中从0到28。同理,通过调用第一拼接函数2,将第2个第三输入张量和第2个第四输入张量按照1轴进行拼接,得到第2个第二输入张量,其形状为(1,29,56,64),第2个第二输入张量中1轴的数据范围是第一输入张量1轴中从28到56。By calling the first splicing function 1, the first third input tensor and the first fourth input tensor are spliced according to the axis 1 to obtain the first second input tensor, whose shape is (1,29, 56,64), the data range of this second input tensor 1 axis is from 0 to 28 in the first input tensor 1 axis. Similarly, by calling the first splicing function 2, the second third input tensor and the second fourth input tensor are spliced according to the 1 axis to obtain the second second input tensor, whose shape is (1 ,29,56,64), the data range of axis 1 in the second second input tensor is from 28 to 56 in axis 1 of the first input tensor.
在本申请实施例中,对于滑动窗口轴不带重叠的切分方式,适合应用于不同计算资源之间需要进行频繁数据同步的场景,例如,多晶粒并行,将拼接函数作为不同裸片之间的数据同步节点,这样,不会造成重叠数据的重复计算,不会造成重叠数据不断增大,可以有效环节计算资源的运算压力和存储压力。In the embodiment of this application, the sliding window axis without overlapping is suitable for scenarios where frequent data synchronization between different computing resources is required, for example, multi-chip parallelism, using the splicing function as a link between different dies In this way, it will not cause repeated calculation of overlapping data, and will not cause the continuous increase of overlapping data, which can effectively reduce the computing pressure and storage pressure of computing resources.
在本申请实施例中,图优化器通过对算子输入张量中不同轴所属的类型,以及轴类型对应的切分方式进行单算子切分,可以实现图优化器不基于具体算子的原理,自动获得不同的单算子切分策略,进而实现图优化器和算子优化模块的完全解耦。In the embodiment of this application, the graph optimizer can realize the principle that the graph optimizer is not based on specific operators by performing single-operator segmentation on the type of different axes in the operator input tensor and the segmentation method corresponding to the axis type. , automatically obtain different single-operator segmentation strategies, and then realize the complete decoupling of the graph optimizer and operator optimization module.
上述内容是对不同类型的轴及其对应的切分方式进行了详细说明,而不同类型的轴具体可以通过以下的数据结构加以表示:The above content is a detailed description of different types of axes and their corresponding segmentation methods, and different types of axes can be represented by the following data structures:
Figure PCTCN2022074576-appb-000004
Figure PCTCN2022074576-appb-000004
需要说明的是,张量轴的类型不局限于本申请实施例中列举的,还可以有其他张量轴及其对应的算子切分方式,本申请实施例对此不作限制。It should be noted that the types of tensor axes are not limited to those listed in the embodiment of the present application, and there may be other tensor axes and corresponding operator segmentation methods, which are not limited in the embodiments of the present application.
需要说明的是,计算资源可以是GPU、CPU、裸片或者芯片等,本申请实施例对计算资源的类型不作限制,另外本申请实施例对计算资源的数量也不作限制,本申请实施例中两个计算资源仅为一种示例。It should be noted that the computing resource can be GPU, CPU, bare chip or chip, etc. The embodiment of the present application does not limit the type of computing resource, and the embodiment of the present application also does not limit the number of computing resources. In the embodiment of the present application The two computing resources are just one example.
在本申请实施例中,图优化器根据不同类型的轴,自动地对算子输入和输出张量切分方式。对于图优化器而言不需要基于具体的算子的原理对输入和输出张量进行切分,只需要基于不同类型的轴对应的算子切分方式对输入和输出张量进行切分,对于算子而言,对 算子的输入和输出张量进行切分前后不会改变算子的计算公式,仅改变算子的部分参数,可以实现图优化和具体算子原理的彻底解耦,进而,基于不同类型轴来进行算子的第一输入张量的切分方式的泛化能力更强。In the embodiment of this application, the graph optimizer automatically splits the operator input and output tensors according to different types of axes. For the graph optimizer, it is not necessary to split the input and output tensors based on the principle of specific operators. It only needs to split the input and output tensors based on the operator splitting methods corresponding to different types of axes. For For the operator, the calculation formula of the operator will not be changed before and after splitting the input and output tensors of the operator, but only some parameters of the operator will be changed, which can realize the complete decoupling of the graph optimization and the specific operator principle, and then , the generalization ability of the segmentation method of the first input tensor of the operator based on different types of axes is stronger.
上述内容是针对单个算子的输入张量和输出张量的切分方式的具体说明,本申请实施例的图优化器可以根据单个算子的输入张量的轴类型以及目标切分轴在单算子的输入张量和输出张量中的位置信息确定其切分方式,而当完成计算任务需要多个算子时,可切分轴在多个算子中输入张量和输出张量中的位置信息可以使得图优化器将不同算子切分方式联级成子图成为可能。下面将结合图14具体说明算子的可切分轴在算子的输入张量和输出张量中的位置信息。The above content is a specific description of the splitting method of the input tensor and the output tensor of a single operator. The position information in the input tensor and output tensor of an operator determines its segmentation method, and when multiple operators are required to complete the calculation task, the position information in the input tensor and output tensor of multiple operators can be split It makes it possible for the graph optimizer to cascade different operator segmentation methods into subgraphs. The position information of the splittable axis of the operator in the input tensor and the output tensor of the operator will be described in detail below with reference to FIG. 14 .
图14是本申请实施例提供的一种算子可切分轴在算子的输入张量和输出张量中的位置信息示意图。Fig. 14 is a schematic diagram of position information of an operator's splittable axis in the input tensor and output tensor of the operator provided by the embodiment of the present application.
算子可切分轴在输入张量和输出张量中的位置信息表示同一根可切分轴在哪些输入张量上和哪些输出张量上,并且同一根可切分轴在输入张量和输出张量中的具体位置。其中每根可切分轴的类型为上述不同类型轴中的一种。The position information of the operator's slicable axis in the input tensor and output tensor indicates which input tensors and which output tensors the same slicable axis is on, and the same slicable axis is in the input tensor and output tensor specific location. The type of each divisible axis is one of the above-mentioned different types of axes.
应理解,本申请实施例对算子的输入张量和输出张量的个数不作限制,可以有多个输入张量经过算子运算之后获得多个输出张量,并且,本申请实施例对输入张量和输出张量中的第一张量轴的数量不作限制。It should be understood that the embodiment of the present application does not limit the number of input tensors and output tensors of the operator. Multiple input tensors can be operated by the operator to obtain multiple output tensors. There is no limit on the number of first tensor axes in input tensors and output tensors.
如图14所示,以算子为卷积算子为例,对于卷积算子而言有两个输入张量,分别是第一特征图输入张量和第一权重输入张量,对应的形状分别为(8,56,56,64)和(4,3,3,64)。其中第一特征图输入张量的0轴、1轴和2轴为滑动窗口轴,3轴为规约轴,第一权重输入张量的0轴为元素轴,1轴、2轴和3轴为规约轴,根据规约轴的不出现在输出张量上,所以,出现在第一输出张量中的张量轴分别是第一特征图输入张量的0轴、1轴和2轴,和第一权重输入张量的0轴,因此,第一输出张量的形状为(8,56,56,4)。As shown in Figure 14, taking the convolution operator as an example, there are two input tensors for the convolution operator, which are the first feature map input tensor and the first weight input tensor, corresponding to The shapes are (8,56,56,64) and (4,3,3,64) respectively. Among them, the 0-axis, 1-axis and 2-axis of the input tensor of the first feature map are the sliding window axes, the 3-axis is the reduction axis, the 0-axis of the first weight input tensor is the element axis, and the 1-axis, 2-axis and 3-axis are The reduction axis, according to the reduction axis, does not appear on the output tensor, so the tensor axes that appear in the first output tensor are the 0 axis, 1 axis and 2 axis of the input tensor of the first feature map, and the first weight The 0 axis of the input tensor, therefore, the shape of the first output tensor is (8,56,56,4).
以上结合图对可切分轴在算子输入张量和输出张量中的位置信息做的具体说明,下面给出了两种可切分轴在算子中的位置信息具体数据结构,分别是以可切分轴为中心的数据结构和以输入张量和输出张量为中心的数据结构。Combined with the above figure, the specific description of the position information of the divisible axis in the input tensor and output tensor of the operator is given. The following two specific data structures of the position information of the divisible axis in the operator are given. Split axis-centric data structures and data structures centered on input tensors and output tensors.
作为一种可能实现的方式,以可切分轴为中心的数据结构包括可切分轴的类型和出现可切分轴的输入张量的类型,以及可切分轴在每个输入张量和输出张量中出现的位置:As a possible implementation, the data structure centered on the splittable axis includes the type of the splittable axis and the type of the input tensor in which the splittable axis appears, and the splittable axis is divided between each input tensor and Occurs in the output tensor:
Figure PCTCN2022074576-appb-000005
Figure PCTCN2022074576-appb-000005
具体地,以相加算子为例,一个输入张量形状为(3,1,5),另一个输入张量形状为(3,4,1),经过相加算子,得到的输出张量为(3,4,5),具体的以可切分轴为中的数据结构可以表示为:Specifically, taking the addition operator as an example, one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1). After the addition operator, the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the divisible axis can be expressed as:
Figure PCTCN2022074576-appb-000006
Figure PCTCN2022074576-appb-000006
作为一种可能实现的方式,以输入张量和输出张量为中心的数据结构包括每个输入张量中每个轴的编号、每个输出张量中每个轴的编号以及每个编号对应的轴的类型:As a possible implementation, the data structure centered on the input tensor and output tensor includes the number of each axis in each input tensor, the number of each axis in each output tensor, and the type of axis corresponding to each number :
input_dim_name_defs:vector<vector<int>>\\表示每个输入中每个轴的编号input_dim_name_defs:vector<vector<int>>\\Indicates the number of each axis in each input
output_dim_name_defs:vector<vector<int>>\\表示每个输出中每个轴的编号output_dim_name_defs:vector<vector<int>>\\Indicates the number of each axis in each output
dim_slice_types:map<int,AXIS_TYPE>\\表示每个编号对应轴的类型。dim_slice_types:map<int,AXIS_TYPE>\\Indicates the type of axis corresponding to each number.
具体地,以相加算子为例,一个输入张量形状为(3,1,5),另一个输入张量形状为(3,4,1),经过相加算子,得到的输出张量为(3,4,5),具体的以张量轴为中的数据结构可以表示为:Specifically, taking the addition operator as an example, one input tensor shape is (3,1,5), and the other input tensor shape is (3,4,1). After the addition operator, the output tensor obtained is The quantity is (3,4,5), and the specific data structure centered on the tensor axis can be expressed as:
Figure PCTCN2022074576-appb-000007
Figure PCTCN2022074576-appb-000007
图优化器根据算子的输入张量中的轴类型和每个轴在输入张量和输出张量的位置信息将不同算子切分联级成子图。不同的轴位置信息可以有不同的应用,下面将结合图15至图17对本申请实施例处理计算任务的算子切分方法的具体应用做详细说明。The graph optimizer divides and cascades different operators into subgraphs according to the axis type in the input tensor of the operator and the position information of each axis in the input tensor and output tensor. Different axis position information may have different applications. The specific application of the operator segmentation method for processing computing tasks in the embodiment of the present application will be described in detail below with reference to FIG. 15 to FIG. 17 .
图15是本申请实施例提供的一种算子切分具体应用的示意图。场景一,第一算子中 包括可切分轴的输出张量作为第二算子的输入张量,对多个连续算子的输入张量的切分优化。Fig. 15 is a schematic diagram of a specific application of operator segmentation provided by the embodiment of this application. Scenario 1: The first operator includes the output tensor of the splittable axis as the input tensor of the second operator, and optimizes the splitting of the input tensors of multiple continuous operators.
作为一种可能实现的方式,根据目标切分轴在不同算子中的轴类型以及在不同算子的第一输入张量和第一输出张量中的位置信息,确定第一输入张量的切分方式。As a possible implementation, according to the axis type of the target split axis in different operators and the position information in the first input tensor and the first output tensor of different operators, determine the split of the first input tensor Way.
具体的,图15中的激活函数算子有两种,分别为ReLU算子和TanH算子。图优化器获取ReLU算子的切分信息和TanH算子的切分信息;图优化器确定的目标切分方式中对应的目标切分轴的轴类型为元素轴,对于ReLU算子,该元素轴出现在ReLU算子的第一输入张量的0轴和第一输出张量的0轴,第一输入张量的形状为(8,56,56,64),根据上述元素轴对应切分方式,第一输出张量的形状也为(8,56,56,64)。对于TanH算子,该元素轴出现在TanH算子的第一输入张量的0轴和第一输出张量的0轴,第一输入张量的形状为(8,56,56,64),因此,根据上述元素轴对应切分方式,第一输出张量的形状也为(8,56,56,64)。Specifically, there are two types of activation function operators in Figure 15, namely the ReLU operator and the TanH operator. The graph optimizer obtains the segmentation information of the ReLU operator and the segmentation information of the TanH operator; the axis type of the corresponding target segmentation axis in the target segmentation method determined by the graph optimizer is an element axis. For the ReLU operator, the element The axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the ReLU operator. The shape of the first input tensor is (8, 56, 56, 64), and it is divided according to the above element axes way, the shape of the first output tensor is also (8,56,56,64). For the TanH operator, the element axis appears on the 0-axis of the first input tensor and the 0-axis of the first output tensor of the TanH operator, and the shape of the first input tensor is (8,56,56,64), Therefore, the shape of the first output tensor is also (8, 56, 56, 64) according to the division method corresponding to the above-mentioned element axis.
如果未知该元素轴出现在ReLU算子和TanH算子的输入张量和输出张量中的位置信息,需要对每个目标计算资源经过ReLU算子后的第二输出张量先进行拼接同步,以获得TanH算子的第一输入张量,然后再对拼接同步后的TanH算子的第一输入张量进行切分,以获得不同目标计算资源的第二输入张量。如图15的(a)所示,完成ReLU算子和TanH算子运算,需要调用两次切分函数和两次拼接函数。If the position information of the element axis appearing in the input tensor and output tensor of the ReLU operator and TanH operator is unknown, it is necessary to splicing and synchronizing the second output tensor of each target computing resource after the ReLU operator to obtain the TanH operator. The first input tensor of the operator, and then split the first input tensor of the TanH operator after splicing and synchronization to obtain the second input tensor of different target computing resources. As shown in (a) of Figure 15, to complete the ReLU operator and TanH operator operations, it is necessary to call the segmentation function twice and the splicing function twice.
由于图优化器已知目标切分轴在不同算子的第一输入张量和第一输出张量中的位置信息,因此,当ReLU算子和TanH算子连续运算时,由于ReLU算子和TanH算子的张量的切分产生的中间拼接算子和中间切分算子可以省略。如图12的(b)所示,元素轴均出现在ReLU算子和TanH算子的输入和输出张量中的0轴上,仅需要一个切分算子节点和一个拼接算子节点,就可以实现对ReLU算子和TanH算子连续运算。Since the graph optimizer knows the position information of the target splitting axis in the first input tensor and the first output tensor of different operators, when the ReLU operator and the TanH operator operate continuously, since the ReLU operator and the TanH operator The intermediate splicing operator and intermediate segmentation operator generated by the segmentation of tensors can be omitted. As shown in (b) of Figure 12, the element axes appear on the 0 axis of the input and output tensors of the ReLU operator and the TanH operator, and only one segmentation operator node and one splicing operator node are needed to realize Continuous operation of ReLU operator and TanH operator.
具体的,根据上述元素轴的切分方式,通过调用一次切分函数,对第一输入张量按照元素轴进行切分就可以得到两个等分的ReLU算子的第二输入张量,在每个计算资源上经过ReLU算子运算得到ReLU算子的第二输出张量,而ReLU算子的第二输出张量作为TanH算子的第二输入张量,在每一个目标计算资源上经过TanH算子运算,得到TanH算子的第二输出张量,最后进行一次拼接算子运算,得到最终的第一输出张量。Specifically, according to the above-mentioned segmentation method of the element axis, by calling the segmentation function once, the first input tensor is segmented according to the element axis to obtain the second input tensor of two equally divided ReLU operators. The second output tensor of the ReLU operator is obtained through the operation of the ReLU operator on each computing resource, and the second output tensor of the ReLU operator is used as the second input tensor of the TanH operator. The TanH operator is operated to obtain the second output tensor of the TanH operator, and finally a splicing operator operation is performed to obtain the final first output tensor.
需要说明的是,本申请实施例中,对目标切分轴在连续算子中的轴类型并不做限定,此处以目标切分轴在第一算子和第二算子中的轴类型相同,并且都为元素轴进行举例说明,目标切分轴在连续算子中的轴类型可以相同,也可以不相同,本申请实施例对此不作限制。It should be noted that, in the embodiment of this application, the axis type of the target segmentation axis in the continuous operator is not limited. Here, the axis types of the target segmentation axis in the first operator and the second operator are the same , and both are element axes for illustration. The axis types of the target split axis in the continuous operator may be the same or different, which is not limited in this embodiment of the present application.
在本申请实施例中,将切分后的输入张量在同一个目标计算资源上进行连续算子运算,这样可以实现多目标计算资源的并行计算。In the embodiment of the present application, continuous operator operations are performed on the divided input tensor on the same target computing resource, so that parallel computing of multiple target computing resources can be realized.
图16是本申请实施例提供的另一种算子切分具体应用的示意图。场景二,可切分轴出现在单算子的多个输入张量和单个输出张量上。Fig. 16 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of this application. In the second scenario, the splittable axis appears on multiple input tensors and a single output tensor of a single operator.
作为一种可能的实现方式,根据第一算子的切分信息和目标计算资源数目K,确定第一算子的输入张量中的切分方式。As a possible implementation manner, according to the segmentation information of the first operator and the target number K of computing resources, the segmentation mode in the input tensor of the first operator is determined.
具体的,以算子为相加算子举例说明,相加算子有两个第一输入张量,第1个第一输入张量x的形状为(m,n),第2个第一输入张量y的形状为(m,)。Specifically, take the operator as an example of an addition operator. The addition operator has two first input tensors. The shape of the first first input tensor x is (m, n), and the second first input tensor is The input tensor y has shape (m,).
根据相加算子的切分信息包括可切分轴1的类型为元素轴,可切分轴1出现在第一输 入张量x的0轴和第一输入张量y的0轴,长度为m;可切分轴2的类型为元素轴,可切分轴轴2出现在第一输入张量x的1轴,长度为n。According to the segmentation information of the addition operator, the type of the slicable axis 1 is an element axis, and the slicable axis 1 appears on the 0-axis of the first input tensor x and the 0-axis of the first input tensor y, and the length is m; splittable axis2 is of type element axis, and splittable axis2 appears on axis 1 of the first input tensor x with length n.
根据第一算子的切分信息,可以确定两种算子切分方式,第一种是对包括长度为m的可切分轴1的输入张量进行切分,第二种是对包括长度为n的可切分轴2的输入张量进行切分。According to the segmentation information of the first operator, two operator segmentation methods can be determined. The first is to segment the input tensor including the slicable axis 1 of length m, and the second is to segment the input tensor including the length Splits the input tensor for n splittable axis 2.
如图16的(a)所示,对包括长度为m的可切分轴1的输入张量进行切分。将长度为m的可切分轴1确定为目标切分轴,根据可切分轴1在相加算子的输入张量中的位置信息,确定第一输入张量x将按照0轴切分和第一输入张量y将按照0轴切分。根据可切分轴1为元素轴的切分方式,可以将第一输入张量x按照0轴等分为两个第二输入张量x0和x1,将第一输入张量y按照0轴等分为两个第二输入张量y0和y1,分别发送给两个目标计算资源上进行相加算子运算,获得第二输出张量,随后通过调用拼接函数,得到第一输出张量。As shown in (a) of FIG. 16 , splitting is performed on an input tensor including a splittable axis 1 of length m. Determine the splittable axis 1 with a length of m as the target splitting axis, and according to the position information of the splittable axis 1 in the input tensor of the addition operator, determine that the first input tensor x will be split according to the 0-axis and the first An input tensor y will be sliced along the 0 axis. According to the splitting method in which the divisible axis 1 is the element axis, the first input tensor x can be equally divided into two second input tensors x0 and x1 according to the 0-axis, and the first input tensor y can be divided into two according to the 0-axis, etc. It is divided into two second input tensors y0 and y1, which are sent to two target computing resources for addition operator operation to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.
如图16的(b)所示,对包括长度为n的可切分轴2的输入张量进行切分。将长度为n的可切分轴2确定为目标切分轴,根据可切分轴1在相加算子的输入张量中的位置信息,确定第一输入张量x将按照1轴切分。根据可切分轴1为元素轴的切分方式,可以将第一输入张量x按照1轴等分为两个第二输入张量x0’和x1’。由于第一输入张量y中无可切分轴2,因此将第一输入张量y作为共享数据发送给不同的目标计算资源。随后,在每个目标计算资源上进行相加算子运算,得到第二输出张量,随后通过调用拼接函数,得到第一输出张量。As shown in (b) of FIG. 16 , splitting is performed on an input tensor including a splittable axis 2 of length n. Determine the splittable axis 2 with a length of n as the target splitting axis, and determine that the first input tensor x will be split according to the 1-axis according to the position information of the splittable axis 1 in the input tensor of the addition operator. According to the splitting method in which the splittable axis 1 is the element axis, the first input tensor x can be equally divided into two second input tensors x0' and x1' according to the 1-axis. Since there is no divisible axis 2 in the first input tensor y, the first input tensor y is sent as shared data to different target computing resources. Subsequently, the addition operator operation is performed on each target computing resource to obtain the second output tensor, and then the first output tensor is obtained by calling the splicing function.
作为一种可能实现的方式,每个目标计算资源可以通过寻址获取第一输入张量y,或者还可以将第一输入张量y复制到每个目标计算资源上,本申请实施例对第一输入张量y的共享方式不作限制。As a possible implementation, each target computing resource can obtain the first input tensor y by addressing, or copy the first input tensor y to each target computing resource. An input tensor y is shared in an unlimited manner.
在本申请实施例中,根据算子的切分信息中包括的可切分轴的轴类型和可切分轴在算子的输入张量和输出张量上的位置信息,可以灵活选择合适的算子切分方式。In the embodiment of this application, according to the axis type of the slicable axis included in the slicing information of the operator and the position information of the slicable axis on the input tensor and output tensor of the operator, the appropriate Operator segmentation method.
图17是本申请实施例提供的又一种算子切分具体应用的示意图。场景三,可切分轴1在第一算子中的第一输入张量和第一输出张量中的位置不同。Fig. 17 is a schematic diagram of another specific application of operator segmentation provided by the embodiment of the present application. Scenario 3, the positions of splittable axis 1 in the first input tensor and the first output tensor of the first operator are different.
如图17的(a)所示,以算子为转换(transpose)算子为例,图优化器获取转换算子的切分信息,根据转换算子的切分信息,确定可切分轴1是元素轴,可切分轴1在第一输入张量中位置,如图17的(a)所示,可切分轴1在第一输入张量中的0轴,以及可切分轴在第一输出张量中的位置,如图17的(a)所示,可切分轴1在第一输出张量中的1轴。基于元素轴的正向形状推导函数,可以根据第一输入张量的形状(8,56,56,64)推导出第一输出张量的形状(56,8,56,64)。As shown in (a) of Figure 17, taking the operator as a transpose operator as an example, the graph optimizer obtains the segmentation information of the transformation operator, and determines the splittable axis 1 according to the segmentation information of the transformation operator is the element axis, the position of the split axis 1 in the first input tensor, as shown in (a) of Figure 17, the split axis 1 is the 0 axis in the first input tensor, and the split axis is in the first output tensor The position of , as shown in (a) of FIG. 17 , can split the 1 axis of axis 1 in the first output tensor. The shape (56, 8, 56, 64) of the first output tensor can be derived from the shape (8, 56, 56, 64) of the first input tensor based on the forward shape inference function of the element axes.
具体地切分方式,如图17的(b)所示,有两个目标计算资源可以用于转换算子运算,将第一输出张量按照1轴进行切分,得到两个1轴长度为4的第二输出张量,再根据元素轴的反向形状推导函数,确定两个第二输入张量的0轴长度为4,通过调用切分函数,对0轴长度为8的第一输入张量按照0轴进行切分,以获得两个第二输入张量。Specifically, as shown in (b) of Figure 17, there are two target computing resources that can be used for conversion operator operations. The first output tensor is divided according to the 1-axis, and two 1-axis lengths are obtained. The second output tensor of 4, and then deduce the function according to the reverse shape of the element axis, determine the 0-axis length of the two second input tensors as 4, and call the segmentation function, for the first input whose 0-axis length is 8 The tensor is split along the 0 axis to obtain two second input tensors.
在本申请实施例中,这样图优化器仅需要知道算子的输入张量的可切分轴的轴类型以及可切分轴在输入张量和输出张量的位置信息,在不需要基于具体算子的类型的情况下,就可以对算子的输入和输出张量进行合适的切分,可以实现算子优化和图优化的彻底解耦。In the embodiment of this application, such a graph optimizer only needs to know the axis type of the splittable axis of the input tensor of the operator and the position information of the splittable axis between the input tensor and the output tensor, and does not need to be based on specific In the case of the type of operator, the input and output tensors of the operator can be properly segmented, and the complete decoupling of operator optimization and graph optimization can be realized.
上述内容是对本申请实施例的处理计算任务方法的描述,下面结合图19对本申请实施例的处理计算任务装置进行说明。应理解,下面描述的装置能够执行前述本申请实施例的方法,为了避免不必要的重复,下面在介绍本申请实施例的装置时适当省略重复的描述。The foregoing content is a description of the method for processing computing tasks in the embodiment of the present application. The device for processing computing tasks in the embodiment of the present application will be described below in conjunction with FIG. 19 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
图19是本申请实施例提供的一种处理计算任务装置示意图。该装置1900应用于图优化器,该装置包括:处理器1901和传输接口1902。可选地,该装置还可以包括存储器1903和总线1904。Fig. 19 is a schematic diagram of an apparatus for processing computing tasks provided by an embodiment of the present application. The device 1900 is applied to a graph optimizer, and the device includes: a processor 1901 and a transmission interface 1902 . Optionally, the device may further include a memory 1903 and a bus 1904 .
其中,存储器1903、处理器1901、传输接口1902通过总线1904实现彼此之间的通信连接。Wherein, the memory 1903 , the processor 1901 , and the transmission interface 1902 realize communication connection with each other through the bus 1904 .
存储器1903可以是ROM,静态存储设备和RAM。存储器1903可以存储程序,当存储器1903中存储的程序被处理器1901执行时,处理器1901和通信接口1902用于执行本申请实施例的处理计算任务方法的各个步骤。The memory 1903 can be ROM, static storage and RAM. The memory 1903 may store a program. When the program stored in the memory 1903 is executed by the processor 1901, the processor 1901 and the communication interface 1902 are used to execute various steps of the method for processing a computing task in the embodiment of the present application.
示例性地,处理器1901用于,确定用于执行计算任务的第一算子,第一算子包括N个可切分轴,N为大于或等于1的正整数;Exemplarily, the processor 1901 is configured to determine a first operator for performing a computing task, where the first operator includes N divisible axes, and N is a positive integer greater than or equal to 1;
处理器1901用于,从算子切分信息库获取第一算子的切分信息,第一算子的切分信息包括N个可切分轴中的第n个可切分轴在第一算子中的轴类型以及第n个可切分轴在第一算子中的位置信息,其中,第n个可切分轴在第一算子中的位置信息用于指示第n个可切分轴在第一算子的输入张量中的位置,其中,n=1,…,N。The processor 1901 is configured to acquire the segmentation information of the first operator from the operator segmentation information database, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes at the first The axis type in the operator and the position information of the nth slicable axis in the first operator, where the position information of the nth slicable axis in the first operator is used to indicate the nth slicable axis The position of the sub-axis in the input tensor of the first operator, where n=1,...,N.
处理器1901用于,根据第一算子的切分信息对第一算子的输入张量进行切分,确定K组输入张量,其中K为大于或等于2的正整数。The processor 1901 is configured to segment the input tensor of the first operator according to the segmentation information of the first operator, and determine K groups of input tensors, where K is a positive integer greater than or equal to 2.
传输接口1902用于,分别发送K组输入张量给K个目标计算资源,以便K个目标计算资源完成计算任务。The transmission interface 1902 is used to send K sets of input tensors to K target computing resources respectively, so that the K target computing resources can complete computing tasks.
应理解,上述内容仅是一种示例性描述,该处理计算任务装置是用于执行前述方法实施例所提及的方法或者步骤,因此,该处理计算任务装置与前述的方法实施例是对应的。具体内容可以参考前述方法实施例的描述,在此不再赘述。It should be understood that the above content is only an exemplary description, and the device for processing computing tasks is used to execute the methods or steps mentioned in the aforementioned method embodiments, therefore, the device for processing computing tasks corresponds to the aforementioned method embodiments . For specific content, reference may be made to the description of the foregoing method embodiments, and details are not repeated here.
处理器1901可以采用通用的,CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的处理计算任务装置中的单元所需执行的功能,或者执行本申请方法实施例的处理计算任务方法。The processor 1901 can be general-purpose, CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs, so as to realize the functions required by the units in the device for processing computing tasks in the embodiment of the present application , or execute the method for processing computing tasks in the method embodiment of the present application.
处理器1901还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的处理计算任务方法的各个步骤可以通过处理器1901中的硬件的集成逻辑电路或者软件形式的指令完成。The processor 1901 may also be an integrated circuit chip, which has a signal processing capability. During implementation, each step of the method for processing a computing task in the embodiment of the present application may be completed by an integrated logic circuit of hardware in the processor 1901 or instructions in the form of software.
上述处理器1901还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1903,处理器1901读取存储器1903中的信息,结合其硬件完成本申请实施例的处理计算任务装置中包括的单元所需执行的功能,或者执行本申请方法实施例的处理计算任务方法。The aforementioned processor 1901 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 1903, and the processor 1901 reads the information in the memory 1903, and combines its hardware to complete the functions required by the units included in the processing computing task device of the embodiment of the application, or execute the processing of the method embodiment of the application Calculation task method.
传输接口1902使用例如但不限于收发器一类的收发装置,来实现装置1900与其他设备或通信网络之间的通信。例如,可以通过传输接口1902获取待处理图像。The transmission interface 1902 implements communication between the apparatus 1900 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, the image to be processed can be obtained through the transmission interface 1902 .
总线1904可包括在装置1900各个部件(例如,存储器1903、处理器1901、传输接口1902)之间传送信息的通路。The bus 1904 may include a path for transferring information between various components of the device 1900 (eg, the memory 1903, the processor 1901, the transmission interface 1902).
应注意,尽管上述装置1900仅仅示出了存储器、处理器、传输接口,但是在具体实现过程中,本领域的技术人员应当理解,装置1900还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置1900还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置1900也可仅仅包括实现本申请实施例所必须的器件,而不必包括图19中所示的全部器件。It should be noted that although the above-mentioned apparatus 1900 only shows a memory, a processor, and a transmission interface, in a specific implementation process, those skilled in the art should understand that the apparatus 1900 may also include other devices necessary for normal operation. Meanwhile, according to specific needs, those skilled in the art should understand that the apparatus 1900 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 1900 may also only include the devices necessary to realize the embodiment of the present application, and does not necessarily include all the devices shown in FIG. 19 .
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。It should also be understood that the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of random access memory (RAM) are available, such as static random access memory (static RAM, SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory Access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory Access memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。The above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of computer program products. The computer program product comprises one or more computer instructions or computer programs. When the computer instruction or computer program is loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive.
本申请实施例提供一种计算机可读存储介质,其用于存储计算机程序,当所述计算机 程序在计算机上运行时,使得所述计算机执行如前述方法实施例中的处理计算任务的方法。An embodiment of the present application provides a computer-readable storage medium, which is used to store a computer program, and when the computer program is run on a computer, the computer executes the method for processing computing tasks as in the foregoing method embodiments.
本申请实施例提供一种计算机程序产品,计算机程序产品包括:计算机程序代码,当所述计算机程序代码被运行时,实现如前述方法实施例中的处理计算任务的方法。An embodiment of the present application provides a computer program product, and the computer program product includes: computer program code, when the computer program code is executed, implements the method for processing computing tasks as in the foregoing method embodiments.
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone, where A and B can be singular or plural. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship, but it may also indicate an "and/or" relationship, which can be understood by referring to the context.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" means one or more, and "multiple" means two or more. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one item (unit) in a, b or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple.
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代 码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims (31)

  1. 一种处理计算任务的方法,其特征在于,所述方法由图优化器执行,所述方法包括:A method for processing computing tasks, characterized in that the method is executed by a graph optimizer, the method comprising:
    确定用于执行计算任务的第一算子,所述第一算子包括N个可切分轴,N为大于或等于1的正整数;determining a first operator for performing a calculation task, the first operator including N divisible axes, where N is a positive integer greater than or equal to 1;
    从算子切分信息库获取所述第一算子的切分信息,所述第一算子的切分信息包括所述N个可切分轴中的第n个可切分轴在所述第一算子中的轴类型以及第一位置信息,其中,所述第一位置信息用于指示所述第n个可切分轴在所述第一算子的输入张量中的位置,其中,n=1,…,N;Obtain the segmentation information of the first operator from the operator segmentation information library, and the segmentation information of the first operator includes the nth slicable axis among the N slicable axes in the Axis type and first position information in the first operator, where the first position information is used to indicate the position of the nth divisible axis in the input tensor of the first operator, where n =1,...,N;
    根据所述第一算子的切分信息,对所述第一算子的输入张量进行切分,获得K组输入张量,其中K为大于或等于2的正整数;Segmenting the input tensor of the first operator according to the segmentation information of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2;
    分别发送所述K组输入张量给K个目标计算资源,以便所述K个目标计算资源完成所述计算任务。The K groups of input tensors are respectively sent to the K target computing resources, so that the K target computing resources complete the computing task.
  2. 如权利要求1所述的方法,其特征在于,所述可切分轴的轴类型为如下类型中的一种:元素轴、规约轴和滑动窗口轴;The method according to claim 1, wherein the axis type of the divisible axis is one of the following types: element axis, reduction axis and sliding window axis;
    其中,算子的输入张量和输出张量中的元素具有点对点映射关系的轴为所述元素轴;Wherein, the axis in which the elements in the input tensor and the output tensor of the operator have a point-to-point mapping relationship is the element axis;
    如果所述算子的输入张量中有第一轴,而所述算子的输出张量中没有所述第一轴,则所述第一轴为所述规约轴;If there is a first axis in the input tensor of the operator but not in the output tensor of the operator, then the first axis is the reduction axis;
    所述算子对所述算子的输入张量中的元素进行滑动窗口扫描操作的轴为所述滑动窗口轴。The axis on which the operator performs a sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
  3. 如权利要求2所述的方法,其特征在于,所述根据所述第一算子的切分信息,对所述第一算子的输入张量进行切分,获得K组输入张量包括:The method according to claim 2, wherein, according to the segmentation information of the first operator, segmenting the input tensor of the first operator, and obtaining K groups of input tensors comprises:
    确定目标切分轴,所述目标切分轴为所述N个可切分轴中的一个;determining a target segmentation axis, where the target segmentation axis is one of the N possible segmentation axes;
    根据所述第一算子的切分信息,确定所述目标切分轴在所述第一算子中的轴类型对应的切分方式;According to the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator;
    根据所述目标切分轴在所述第一算子中的轴类型对应的切分方式,对所述第一算子的输入张量进行切分,获得所述K组输入张量。According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
  4. 如权利要求3所述的方法,其特征在于,根据所述目标切分轴在所述第一算子中的轴类型对应的切分方式,对所述第一算子的输入张量进行切分,获得所述K组输入张量包括:The method according to claim 3, wherein the input tensor of the first operator is segmented according to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator points, obtaining the K sets of input tensors includes:
    根据所述切分方式,确定所述第一算子中包括所述目标切分轴的Q个第一输入张量以及所述目标切分轴在所述Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;According to the splitting method, determine the Q first input tensors including the target splitting axis in the first operator and each of the Q first input tensors with the target splitting axis in the Q first input tensors A position in the input tensor, where Q is a positive integer greater than or equal to 1;
    根据所述目标切分轴在所述第一算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中所述Q组第二输入张量中的每组第二输入张量包括K个第二输入张量;respectively splitting each of the Q first input tensors according to the axis type of the target splitting axis in the first operator and the quantity K of the target computing resources, Obtaining Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors;
    根据所述Q组第二输入张量和未切分的所述第一算子的输入张量,获得所述K组输入张量。The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
  5. 如权利要求2所述的方法,其特征在于,在用于执行所述计算任务的算子还包括第二算子的情况下,所述第二算子包括P个可切分轴,所述P个可切分轴为所述N个可切分轴的子集,The method according to claim 2, wherein, in the case that the operator for performing the calculation task further includes a second operator, the second operator includes P splitable axes, and the The P slicable axes are a subset of the N slicable axes,
    所述根据所述第一算子的切分信息,对所述第一算子的输入张量进行切分,获得K组输入张量包括:The step of segmenting the input tensor of the first operator according to the segmentation information of the first operator, and obtaining K sets of input tensors includes:
    从所述算子切分信息库获取所述第二算子的切分信息,所述第二算子的切分信息包括所述P个可切分轴中的第p个可切分轴在所述第二算子中的轴类型和第二位置信息,其中,所述第二位置信息用于指示所述第p个切分轴在所述第二算子的输入张量中的位置,所述第二算子的输入张量为所述第一算子的输出张量,其中,P为大于或等于1且小于或等于N的的正整数,p=1,…,P;The segmentation information of the second operator is obtained from the operator segmentation information library, and the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes. Axis type and second position information in the second operator, wherein the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, the The input tensor of the second operator is the output tensor of the first operator, wherein P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P;
    根据所述第一算子的切分信息和所述第二算子的切分信息,确定P个切分参考信息,所述P个切分参考信息中的第p个切分参考信息包括:所述第p个可切分轴在所述第一算子中的轴类型、所述第p个可切分轴在所述第二算子中的轴类型、所述第p个可切分轴在所述第一算子的输入张量中的位置;According to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: The axis type of the p-th divisible axis in the first operator, the axis type of the p-th divisible axis in the second operator, the p-th divisible axis the position of the axis in the input tensor of the first operator;
    根据所述P个切分参考信息,确定P组候选切分方式,其中,所述P组候选切分方式中的第p组候选切分方式包括至少一个切分方式;According to the P pieces of segmentation reference information, determine P groups of candidate segmentation methods, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method;
    根据P组候选切分方式中的每个切分方式完成所述计算任务需要的时间,确定目标切分方式;According to the time required for each segmentation method in the P group of candidate segmentation methods to complete the calculation task, determine the target segmentation method;
    根据所述目标切分方式,对所述第一算子的输入张量进行切分,获得K组输入张量。According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
  6. 如权利要求5所述的方法,其特征在于,所述根据所述目标切分方式,对所述第一算子的输入张量进行切分,获得K组输入张量包括:The method according to claim 5, wherein, according to the target segmentation method, segmenting the input tensor of the first operator, and obtaining K groups of input tensors comprises:
    根据所述目标切分方式,确定目标切分轴、所述目标切分轴在所述第一算子中的轴类型、所述目标切分轴在所述第二算子中的轴类型、所述第一算子中包括所述目标切分轴的Q个第一输入张量以及所述目标切分轴在所述Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, The first operator includes Q first input tensors of the target segmentation axis and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1;
    根据所述目标切分轴在所述第一算子中的轴类型和所述目标切分轴在所述第二算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,According to the axis type of the target splitting axis in the first operator and the axis type of the target splitting axis in the second operator and the quantity K of the target computing resources, the Each of the Q first input tensors is segmented to obtain Q groups of second input tensors,
    其中所述Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,所述Q组第二输入张量中的第q组第二输入张量是所述Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q;Wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups of second input tensors is the Q first The qth first input tensor in the input tensor is divided into K segmentation results, where q=1,...,Q;
    根据所述Q组第二输入张量和未切分的所述第一算子的输入张量,确定所述K组输入张量。The K sets of input tensors are determined according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
  7. 如权利要求4所述的方法,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述元素轴或者所述滑动窗口轴时,所述目标切分轴的第一位置信息还用于指示所述目标切分轴在所述第一算子的输出张量中的位置,所述根据所述目标切分轴在所述第一算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个 第一输入张量进行切分,得到Q组第二输入张量包括:The method according to claim 4, wherein when the axis type of the target segmentation axis in the first operator is the element axis or the sliding window axis, the target segmentation axis The first position information of is also used to indicate the position of the target splitting axis in the output tensor of the first operator, and the axis type of the target splitting axis in the first operator and the The quantity K of the target computing resource is divided into each first input tensor in the Q first input tensors respectively to obtain Q groups of second input tensors including:
    根据所述目标切分轴的所述第一位置信息,确定所述第一算子中包括所述目标切分轴的L个第一输出张量,以及所述目标切分轴在所述L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;According to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within the L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1;
    将第一输入长度作为所述目标切分轴的正向形状推导函数的输入,获得第一输出长度,所述第一输入长度为所述目标切分轴在每个所述第一输入张量中的长度,其中,所述目标切分轴在每个所述第一输入张量中的长度相等;Using the first input length as the input of the forward shape derivation function of the target split axis to obtain a first output length, the first input length is the target split axis in each of the first input tensors length, wherein the target split axes are of equal length in each of the first input tensors;
    根据所述第一输出长度和所述目标计算资源的数量K,对所述L个第一输出张量按照所述目标切分轴进行切分,获得所述L组第二输出张量,所述L组第二输出张量中每组第二输出张量包括K个第二输出张量;According to the first output length and the number K of the target computing resources, segment the L first output tensors according to the target segmentation axis to obtain the L groups of second output tensors, and Each group of second output tensors in the L groups of second output tensors includes K second output tensors;
    将所述L组第二输出张量中每组所述第二输出张量中所述目标切分轴对应的K个第二输出长度分别作为所述目标切分轴的反向推导函数的输入,得到所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度;Using the K second output lengths corresponding to the target segmentation axis in each of the L groups of second output tensors as the input of the reverse derivation function of the target segmentation axis to obtain the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors;
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors, each of the first input tensors in the Q first input tensors is respectively The tensor is split according to the target splitting axis to obtain the Q group of second input tensors.
  8. 如权利要求7所述的方法,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述元素轴时,所述根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量包括:The method according to claim 7, wherein when the axis type of the target splitting axis in the first operator is the element axis, each of the second input tensors according to the Q group Set the K second input lengths corresponding to the target splitting axis in the second input tensor, and respectively perform each of the first input tensors in the Q first input tensors according to the target splitting axis Segmentation to obtain the Q group of second input tensors includes:
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个所述第二输入长度,通过调度第一切分函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first split function, the Q first Each of the first input tensors in the input tensors is split according to the target splitting axis to obtain the Q groups of second input tensors.
  9. 如权利要求7所述的方法,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述滑动窗口轴时,所述根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量包括:The method according to claim 7, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the second input tensor according to the Q group K second input lengths corresponding to the target split axis in each group of the second input tensors, each of the first input tensors in the Q first input tensors according to the target split axis Segmentation is performed to obtain the Q group of second input tensors including:
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,通过调度第一切片函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行带重叠的切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first slice function, respectively for the Q first input tensors Each of the first input tensors is split with overlap according to the target splitting axis to obtain the Q groups of second input tensors.
  10. 如权利要求7所述的方法,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述滑动窗口轴时,所述根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量包括:The method according to claim 7, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the second input tensor according to the Q group K second input lengths corresponding to the target split axis in each group of the second input tensors, each of the first input tensors in the Q first input tensors according to the target split axis Segmentation is performed to obtain the Q group of second input tensors including:
    通过调度第二切分函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到Q组第三输入张量,所述Q组第三输入张量包括K个第三输入张量;By scheduling the second slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q groups of third input tensors, the Q sets of third input tensors include K third input tensors;
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,通过调度第二切片函数,分别对所述Q组第三输入张量中每组所述第三输 入张量K个所述第三输入张量按照所述目标切分轴进行切分,得到Q组第四输入张量;According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the second slice function, each of the Q groups of third input tensors is respectively Grouping the third input tensors K of the third input tensors is segmented according to the target segmentation axis to obtain Q groups of fourth input tensors;
    通过调度拼接函数,将所述Q组第四输入张量中第q组所述第四输入张量中第k个所述第四输入张量和所述Q组第三输入张量中第q组所述第三输入张量中第k个所述第三输入张量按照所述目标切分轴进行拼接,获得所述Q组第二输入张量。By scheduling the splicing function, the kth fourth input tensor of the qth group of the fourth input tensors in the Q group of fourth input tensors and the qth group of the third input tensors of the Q group of third input tensors The kth third input tensor in the tensor is spliced according to the target segmentation axis to obtain the Q group of second input tensors.
  11. 如权利要求4所述的方法,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述规约轴时,根据所述目标切分轴在所述第一算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量包括:The method according to claim 4, wherein when the axis type of the target segmentation axis in the first operator is the reduction axis, according to the target segmentation axis in the first The axis type in the operator and the number K of the target computing resources are respectively segmented for each of the Q first input tensors to obtain Q groups of second input tensors including:
    根据所述目标计算资源的数量K,通过调用第三切分函数,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量。According to the quantity K of the target computing resources, each first input tensor in the Q first input tensors is respectively divided by calling a third division function to obtain Q groups of second input tensors.
  12. 如权利要求11所述的方法,其特征在于,所述规约轴包括第一类规约轴和第二类规约轴,其中,所述第一类归规约轴为所述算子对所述算子的输入张量中的元素进行缩减操作的规约轴,所述第二类归规约轴为所述算子对所述算子的输入张量中的元素不进行缩减操作的规约轴。The method according to claim 11, wherein the reduction axis comprises a first type of reduction axis and a second type of reduction axis, wherein the first type of reduction axis is the operator to the operator The reduction axis on which the elements in the input tensor of the operator are reduced, and the second reduction axis is the reduction axis on which the operator does not perform the reduction operation on the elements in the input tensor of the operator.
  13. 如权利要求12所述的方法,其特征在于,所述第一类规约轴包括如下中任意一种:规约之和轴、规约最大值轴、规约最小值轴、规约平均值轴;The method according to claim 12, wherein the first type of statute axis includes any one of the following: a statute sum axis, a statute maximum value axis, a statute minimum value axis, and a statute mean value axis;
    其中,所述规约之和轴为所述算子对所述算子的输入张量中的元素进行求和缩减操作的规约轴;Wherein, the sum axis of the reduction is the reduction axis on which the operator performs a sum reduction operation on the elements in the input tensor of the operator;
    所述规约最大值轴为所述算子对所述算子的输入张量中的元素进行求最大值缩减操作的规约轴;The reduction maximum axis is the reduction axis on which the operator performs a maximum reduction operation on the elements in the input tensor of the operator;
    所述规约最小值轴为所述算子对所述算子的输入张量中的元素进行求最小值缩减操作的规约轴;The reduction minimum axis is the reduction axis on which the operator performs a minimum reduction operation on the elements in the input tensor of the operator;
    所述规约平均值轴为所述算子对所述算子的输入张量中的元素进行求平均值缩减操作的规约轴。The reduction average axis is the reduction axis on which the operator performs an average reduction operation on elements in the operator's input tensor.
  14. 如权利要求12所述的方法,其特征在于,所述第二类规约轴包括规约采集轴,所述规约采集轴为所述算子根据算子的索引输入张量上元素指示的地址在所述算子的输入张量上的元素索引数据的轴。The method according to claim 12, wherein the second type of reduction axis includes a reduction acquisition axis, and the reduction acquisition axis is where the address indicated by the element on the input tensor of the operator is based on the index of the operator. The axis along which the element-wise index data on the input tensor to the operator is described.
  15. 如权利要求1至14任一项所述的方法,其特征在于,所述目标计算资源包括如下种类中的一种:The method according to any one of claims 1 to 14, wherein the target computing resource includes one of the following types:
    图像处理单元GPU、中心处理单元CPU、裸片die或者芯片chip。Image processing unit GPU, central processing unit CPU, bare chip die or chip chip.
  16. 一种处理计算任务的装置,其特征在于,所述装置应用于图优化器,所述装置包括处理器和传输接口:A device for processing computing tasks, characterized in that the device is applied to a graph optimizer, and the device includes a processor and a transmission interface:
    所述处理器用于,确定用于执行计算任务的第一算子,所述第一算子包括N个可切分轴,N为大于或等于1的正整数;The processor is configured to determine a first operator for performing a calculation task, the first operator includes N divisible axes, and N is a positive integer greater than or equal to 1;
    所述处理器用于,从算子切分信息库获取所述第一算子的切分信息,所述第一算子的切分信息包括所述N个可切分轴中的第n个可切分轴在所述第一算子中的轴类型以及第一位置信息,其中,所述第一位置信息用于指示所述第n个可切分轴在所述第一算子的输入张量中的位置,其中,n=1,…,N;The processor is configured to acquire segmentation information of the first operator from an operator segmentation information library, where the segmentation information of the first operator includes the n th slicable axis among the N slicable axes. Axis type and first position information of the split axis in the first operator, wherein the first position information is used to indicate that the nth splitable axis is in the input tensor of the first operator The position of , among them, n=1,...,N;
    所述处理器用于,根据所述第一算子的切分信息,对所述第一算子的输入张量进行切 分,获得K组输入张量,其中K为大于或等于2的正整数;The processor is configured to, according to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K sets of input tensors, where K is a positive integer greater than or equal to 2 ;
    所述传输接口用于,分别发送所述K组输入张量给K个目标计算资源,以便所述K个目标计算资源完成所述计算任务。The transmission interface is configured to respectively send the K groups of input tensors to the K target computing resources, so that the K target computing resources complete the computing task.
  17. 如权利要求16所述的装置,其特征在于,所述可切分轴的轴类型为如下类型中的一种:元素轴、规约轴和滑动窗口轴;The device according to claim 16, wherein the axis type of the divisible axis is one of the following types: element axis, reduction axis and sliding window axis;
    其中,算子的输入张量和输出张量中的元素具有点对点映射关系的轴为所述元素轴;Wherein, the axis in which the elements in the input tensor and the output tensor of the operator have a point-to-point mapping relationship is the element axis;
    如果所述算子的输入张量中有第一轴,而所述算子的输出张量中没有所述第一轴,则所述第一轴为所述规约轴;If there is a first axis in the input tensor of the operator but not in the output tensor of the operator, then the first axis is the reduction axis;
    所述算子对所述算子的输入张量中的元素进行滑动窗口扫描操作的轴为所述滑动窗口轴。The axis on which the operator performs a sliding window scanning operation on the elements in the input tensor of the operator is the sliding window axis.
  18. 如权利要求17所述的装置,其特征在于,所述处理器用于,根据所述第一算子的切分信息,对所述第一算子的输入张量进行切分,获得K组输入张量包括:The device according to claim 17, wherein the processor is configured to, according to the segmentation information of the first operator, segment the input tensor of the first operator to obtain K groups of input Tensors include:
    所述处理器用于:The processor is used to:
    确定目标切分轴,所述目标切分轴为所述N个可切分轴中的一个;determining a target segmentation axis, where the target segmentation axis is one of the N possible segmentation axes;
    根据所述第一算子的切分信息,确定所述目标切分轴在所述第一算子中的轴类型对应的切分方式;According to the segmentation information of the first operator, determine the segmentation method corresponding to the axis type of the target segmentation axis in the first operator;
    根据所述目标切分轴在所述第一算子中的轴类型对应的切分方式,对所述第一算子的输入张量进行切分,获得所述K组输入张量。According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, the input tensor of the first operator is segmented to obtain the K sets of input tensors.
  19. 如权利要求18所述的装置,其特征在于,所述处理器具体用于,The apparatus according to claim 18, wherein the processor is specifically configured to:
    根据所述目标切分轴在所述第一算子中的轴类型对应的切分方式,确定所述第一算子中的轴类型、所述第一算子中包括所述目标切分轴的Q个第一输入张量以及所述目标切分轴在所述Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;According to the segmentation method corresponding to the axis type of the target segmentation axis in the first operator, determine the axis type in the first operator, and the target segmentation axis is included in the first operator The Q first input tensors and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein Q is a positive integer greater than or equal to 1;
    根据所述目标切分轴在所述第一算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,其中所述Q组第二输入张量中的每组第二输入张量包括K个第二输入张量;respectively splitting each of the Q first input tensors according to the axis type of the target splitting axis in the first operator and the quantity K of the target computing resources, Obtaining Q groups of second input tensors, wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors;
    根据所述Q组第二输入张量和未切分的所述第一算子的输入张量,获得所述K组输入张量。The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
  20. 如权利要求17所述的装置,其特征在于,在用于执行所述计算任务的算子还包括第二算子的情况下,所述第二算子包括P个可切分轴,所述P个可切分轴为所述N个可切分轴的子集,The device according to claim 17, wherein, in the case that the operator for performing the computing task further includes a second operator, the second operator includes P splitable axes, and the The P slicable axes are a subset of the N slicable axes,
    所述处理器具体用于:The processor is specifically used for:
    从所述算子切分信息库获取所述第二算子的切分信息,所述第二算子的切分信息包括所述P个可切分轴中的第p个可切分轴在所述第二算子中的轴类型和第二位置信息,其中,所述第二位置信息用于指示所述第p个切分轴在所述第二算子的输入张量中的位置,所述第二算子的输入张量为所述第一算子的输出张量,其中,P为大于或等于1且小于或等于N的的正整数,p=1,…,P;The segmentation information of the second operator is obtained from the operator segmentation information library, and the segmentation information of the second operator includes the p-th slicable axis among the P slicable axes. Axis type and second position information in the second operator, wherein the second position information is used to indicate the position of the p-th split axis in the input tensor of the second operator, the The input tensor of the second operator is the output tensor of the first operator, wherein P is a positive integer greater than or equal to 1 and less than or equal to N, p=1,...,P;
    根据所述第一算子的切分信息和所述第二算子的切分信息,确定P个切分参考信息,所述P个切分参考信息中的第p个切分参考信息包括:所述第p个可切分轴在所述第一算 子中的轴类型、所述第p个可切分轴在所述第二算子中的轴类型、所述第p个可切分轴在所述第一算子的输入张量中的位置;According to the segmentation information of the first operator and the segmentation information of the second operator, P pieces of segmentation reference information are determined, and the p-th segmentation reference information in the P pieces of segmentation reference information includes: The axis type of the p-th divisible axis in the first operator, the axis type of the p-th divisible axis in the second operator, the p-th divisible axis the position of the axis in the input tensor of the first operator;
    根据所述P个切分参考信息,确定P组候选切分方式,其中,所述P组候选切分方式中的第p组候选切分方式包括至少一个切分方式;According to the P pieces of segmentation reference information, determine P groups of candidate segmentation methods, wherein the pth group of candidate segmentation methods in the P group of candidate segmentation methods includes at least one segmentation method;
    根据P组候选切分方式中的每个切分方式完成所述计算任务需要的时间,确定目标切分方式;According to the time required for each segmentation method in the P group of candidate segmentation methods to complete the calculation task, determine the target segmentation method;
    根据所述目标切分方式,对所述第一算子的输入张量进行切分,获得K组输入张量。According to the target segmentation method, the input tensor of the first operator is segmented to obtain K sets of input tensors.
  21. 如权利要求20所述的装置,其特征在于,所述处理器具体用于:The device according to claim 20, wherein the processor is specifically configured to:
    根据所述目标切分方式,确定目标切分轴、所述目标切分轴在所述第一算子中的轴类型、所述目标切分轴在所述第二算子中的轴类型、所述第一算子中包括所述目标切分轴的Q个第一输入张量以及所述目标切分轴在所述Q个第一输入张量中的每个第一输入张量中的位置,其中,Q为大于或等于1的正整数;According to the target segmentation method, determine the target segmentation axis, the axis type of the target segmentation axis in the first operator, the axis type of the target segmentation axis in the second operator, The first operator includes Q first input tensors of the target segmentation axis and the position of the target segmentation axis in each first input tensor of the Q first input tensors, wherein, Q is a positive integer greater than or equal to 1;
    根据所述目标切分轴在所述第一算子中的轴类型和所述目标切分轴在所述第二算子中的轴类型和所述目标计算资源的数量K,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量,According to the axis type of the target splitting axis in the first operator and the axis type of the target splitting axis in the second operator and the quantity K of the target computing resources, the Each of the Q first input tensors is segmented to obtain Q groups of second input tensors,
    其中所述Q组第二输入张量中的每组第二输入张量包括K个第二输入张量,所述Q组第二输入张量中的第q组第二输入张量是所述Q个第一输入张量中的第q个第一输入张量切分成K个的切分结果,其中,q=1,…,Q;Wherein each group of second input tensors in the Q groups of second input tensors includes K second input tensors, and the qth group of second input tensors in the Q groups of second input tensors is the Q first The qth first input tensor in the input tensor is divided into K segmentation results, where q=1,...,Q;
    根据所述Q组第二输入张量和未切分的所述第一算子的输入张量,获得所述K组输入张量。The K sets of input tensors are obtained according to the Q sets of second input tensors and the unsegmented input tensors of the first operator.
  22. 如权利要求19所述的装置,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述元素轴或者所述滑动窗口轴时,所述目标切分轴的第一位置信息还用于指示所述目标切分轴在所述第一算子的输出张量中的位置,所述处理器具体用于:The device according to claim 19, wherein when the axis type of the target segmentation axis in the first operator is the element axis or the sliding window axis, the target segmentation axis The first position information of is also used to indicate the position of the target segmentation axis in the output tensor of the first operator, and the processor is specifically configured to:
    根据所述目标切分轴的所述第一位置信息,确定所述第一算子中包括所述目标切分轴的L个第一输出张量,以及所述目标切分轴在所述L个第一输出张量中的每个第一输出张量中的位置,L为大于或等于1的正整数;According to the first position information of the target splitting axis, determine L first output tensors including the target splitting axis in the first operator, and the target splitting axis is within the L The position in each first output tensor in the first output tensor, L is a positive integer greater than or equal to 1;
    将第一输入长度作为所述目标切分轴的正向形状推导函数的输入,获得第一输出长度,所述第一输入长度为所述目标切分轴在每个所述第一输入张量中的长度,其中,所述目标切分轴在每个所述第一输入张量中的长度相等;Using the first input length as the input of the forward shape derivation function of the target split axis to obtain a first output length, the first input length is the target split axis in each of the first input tensors length, wherein the target split axes are of equal length in each of the first input tensors;
    根据所述第一输出长度和所述目标计算资源的数量K,对所述L个第一输出张量按照所述目标切分轴进行切分,获得所述L组第二输出张量,所述L组第二输出张量中每组第二输出张量包括K个第二输出张量,所述L组第二输出张量中第l组第二输出张量是所述L个第一输出张量中第l个第一输出张量切分成K个的切分结果;According to the first output length and the number K of the target computing resources, segment the L first output tensors according to the target segmentation axis to obtain the L groups of second output tensors, and Each group of second output tensors in the L groups of second output tensors includes K second output tensors, and the lth group of second output tensors in the L groups of second output tensors is the lth group of the L first output tensors. The first output tensor is divided into K segmentation results;
    将所述L组第二输出张量中每组所述第二输出张量中所述目标切分轴对应的K个第二输出长度分别作为所述目标切分轴的反向推导函数的输入,得到所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,其中,所述目标切分轴在所述L组第二输出张量中每组所述第二输出张量中第k个第二输出张量中对应的长度相等,所述目标切分轴在所述Q组第二输入张量中每组第二输入张量中第k个第二输入张量中对应的长度相等;Using the K second output lengths corresponding to the target segmentation axis in each of the L groups of second output tensors as the input of the reverse derivation function of the target segmentation axis to obtain the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, wherein the target split axes are described in each group of the L groups of second output tensors The lengths corresponding to the kth second output tensor in the second output tensor are equal, and the lengths corresponding to the kth second input tensor in each group of second input tensors in the Q group of second input tensors are equal to each other;
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target segmentation axis in each group of the second input tensors in the Q group of second input tensors, each of the first input tensors in the Q first input tensors is respectively The tensor is split according to the target splitting axis to obtain the Q group of second input tensors.
  23. 如权利要求22所述的装置,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述元素轴时,所述处理器具体用于:The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the element axis, the processor is specifically configured to:
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个所述第二输入长度,通过调度第一切分函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target split axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first split function, the Q first Each of the first input tensors in the input tensors is split according to the target splitting axis to obtain the Q groups of second input tensors.
  24. 如权利要求22所述的装置,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述滑动窗口轴时,所述处理器具体用于:The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,通过调度第一切片函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行带重叠的切分,得到所述Q组第二输入张量。According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the first slice function, respectively for the Q first input tensors Each of the first input tensors is split with overlap according to the target splitting axis to obtain the Q groups of second input tensors.
  25. 如权利要求22所述的装置,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述滑动窗口轴时,所述处理器具体用于:The device according to claim 22, wherein when the axis type of the target segmentation axis in the first operator is the sliding window axis, the processor is specifically configured to:
    通过调度第二切分函数,分别对所述Q个第一输入张量中的每个所述第一输入张量按照所述目标切分轴进行切分,得到Q组第三输入张量,所述Q组第三输入张量包括K个第三输入张量;By scheduling the second slicing function, each of the Q first input tensors is segmented according to the target slicing axis to obtain Q groups of third input tensors, the Q sets of third input tensors include K third input tensors;
    根据所述Q组第二输入张量中每组所述第二输入张量中所述目标切分轴对应的K个第二输入长度,通过调度第二切片函数,分别对所述Q组第三输入张量中每组所述第三输入张量K个所述第三输入张量按照所述目标切分轴进行切分,得到Q组第四输入张量;According to the K second input lengths corresponding to the target segmentation axes in each group of the second input tensors in the Q group of second input tensors, by scheduling the second slice function, each of the Q groups of third input tensors is respectively Grouping the third input tensors K of the third input tensors is segmented according to the target segmentation axis to obtain Q groups of fourth input tensors;
    通过调度拼接函数,将所述Q组第四输入张量中第q组所述第四输入张量中第k个所述第四输入张量和所述Q组第三输入张量中第q组所述第三输入张量中第k个所述第三输入张量按照所述目标切分轴进行拼接,获得所述Q组第二输入张量。By scheduling the splicing function, the kth fourth input tensor of the qth group of the fourth input tensors in the Q group of fourth input tensors and the qth group of the third input tensors of the Q group of third input tensors The kth third input tensor in the tensor is spliced according to the target segmentation axis to obtain the Q group of second input tensors.
  26. 如权利要求19所述的装置,其特征在于,当所述目标切分轴在所述第一算子中的轴类型为所述规约轴时,所述处理器具体用于:The device according to claim 19, wherein when the axis type of the target segmentation axis in the first operator is the reduction axis, the processor is specifically configured to:
    根据所述目标计算资源的数量K,通过调用第三切分函数,分别对所述Q个第一输入张量中的每个第一输入张量进行切分,得到Q组第二输入张量。According to the quantity K of the target computing resources, each first input tensor in the Q first input tensors is respectively divided by calling a third division function to obtain Q groups of second input tensors.
  27. 如权利要求26所述的装置,其特征在于,所述规约轴包括第一类规约轴和第二类规约轴,其中,所述第一类归规约轴为所述算子对所述算子的输入张量中的元素进行缩减操作的规约轴,所述第二类归规约轴为所述算子对所述算子的输入张量中的元素不进行缩减操作的规约轴。The device according to claim 26, wherein the reduction axis comprises a first type of reduction axis and a second type of reduction axis, wherein the first type of reduction axis is the operator to the operator The reduction axis on which the elements in the input tensor of the operator are reduced, and the second reduction axis is the reduction axis on which the operator does not perform the reduction operation on the elements in the input tensor of the operator.
  28. 如权利要求27所述的装置,其特征在于,所述第一类规约轴包括如下中任意一种:规约之和轴、规约最大值轴、规约最小值轴、规约平均值轴;The device according to claim 27, wherein the first type of statute axis includes any one of the following: a statute sum axis, a statute maximum value axis, a statute minimum value axis, and a statute average axis;
    其中,所述规约之和轴为所述算子对所述算子的输入张量中的元素进行求和缩减操作的规约轴;Wherein, the sum axis of the reduction is the reduction axis on which the operator performs a sum reduction operation on the elements in the input tensor of the operator;
    所述规约最大值轴为所述算子对所述算子的输入张量中的元素进行求最大值缩减操作的规约轴;The reduction maximum axis is the reduction axis on which the operator performs a maximum reduction operation on the elements in the input tensor of the operator;
    所述规约最小值轴为所述算子对所述算子的输入张量中的元素进行求最小值缩减操 作的规约轴;The reduction minimum axis is the reduction axis on which the operator performs a minimum reduction operation on the elements in the input tensor of the operator;
    所述规约平均值轴为所述算子对所述算子的输入张量中的元素进行求平均值缩减操作的规约轴。The reduction average axis is the reduction axis on which the operator performs an average reduction operation on elements in the operator's input tensor.
  29. 如权利要求27所述的装置,其特征在于,所述第二类规约轴包括规约采集轴,所述规约采集轴为所述算子根据算子的索引输入张量上元素指示的地址在所述算子的输入张量上的元素索引数据的轴。The device according to claim 27, wherein the second type of reduction axis includes a reduction acquisition axis, and the reduction acquisition axis is where the address indicated by the element on the input tensor of the operator is based on the index of the operator. The axis along which the element-wise index data on the input tensor to the operator is described.
  30. 如权利要求16至29任一项所述的装置,其特征在于,所述目标计算资源包括如下种类中的一种:The device according to any one of claims 16 to 29, wherein the target computing resource includes one of the following types:
    图像处理单元GPU、中心处理单元CPU、裸片die或者芯片chip。Image processing unit GPU, central processing unit CPU, bare chip die or chip chip.
  31. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储程序代码,该程序代码包括用于执行如权利要求1至15中任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable medium stores program codes, and the program codes are used to execute the method according to any one of claims 1 to 15.
PCT/CN2022/074576 2022-01-28 2022-01-28 Method and device for processing computing task WO2023141939A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/074576 WO2023141939A1 (en) 2022-01-28 2022-01-28 Method and device for processing computing task
CN202280012811.6A CN116888601A (en) 2022-01-28 2022-01-28 Method and device for processing computing task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/074576 WO2023141939A1 (en) 2022-01-28 2022-01-28 Method and device for processing computing task

Publications (2)

Publication Number Publication Date
WO2023141939A1 true WO2023141939A1 (en) 2023-08-03
WO2023141939A9 WO2023141939A9 (en) 2024-07-25

Family

ID=87469960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074576 WO2023141939A1 (en) 2022-01-28 2022-01-28 Method and device for processing computing task

Country Status (2)

Country Link
CN (1) CN116888601A (en)
WO (1) WO2023141939A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118312327A (en) * 2024-06-06 2024-07-09 北京壁仞科技开发有限公司 Hardware resource allocation method, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465108A (en) * 2020-11-11 2021-03-09 上海交通大学 Neural network compiling method for storage and calculation integrated platform
US10970619B1 (en) * 2020-08-21 2021-04-06 Moffett Technologies Co., Limited Method and system for hierarchical weight-sparse convolution processing
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and data processing equipment
WO2021190761A1 (en) * 2020-03-27 2021-09-30 Huawei Technologies Co., Ltd. Parallel computing scheme generation for neural networks
CN113485837A (en) * 2021-07-21 2021-10-08 瀚博半导体(上海)有限公司 Tensor processing method and processing system based on parallel branch and tensor segmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449857A (en) * 2020-03-27 2021-09-28 华为技术有限公司 Data processing method and data processing equipment
WO2021190761A1 (en) * 2020-03-27 2021-09-30 Huawei Technologies Co., Ltd. Parallel computing scheme generation for neural networks
US10970619B1 (en) * 2020-08-21 2021-04-06 Moffett Technologies Co., Limited Method and system for hierarchical weight-sparse convolution processing
CN112465108A (en) * 2020-11-11 2021-03-09 上海交通大学 Neural network compiling method for storage and calculation integrated platform
CN113485837A (en) * 2021-07-21 2021-10-08 瀚博半导体(上海)有限公司 Tensor processing method and processing system based on parallel branch and tensor segmentation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118312327A (en) * 2024-06-06 2024-07-09 北京壁仞科技开发有限公司 Hardware resource allocation method, electronic device and storage medium

Also Published As

Publication number Publication date
WO2023141939A9 (en) 2024-07-25
CN116888601A (en) 2023-10-13

Similar Documents

Publication Publication Date Title
US20220391678A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CN110321999B (en) Neural network computational graph optimization method
EP4036724A1 (en) Method for splitting neural network model by using multi-core processor, and related product
WO2021190127A1 (en) Data processing method and data processing device
EP4036810A1 (en) Neural network processing method and apparatus, computer device and storage medium
US20220121903A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
WO2021190597A1 (en) Processing method for neural network model, and related device
CN113994350A (en) Generating parallel computing schemes for neural networks
WO2023093623A1 (en) Computation graph optimization method, data processing method and related product
WO2023093724A1 (en) Neural network model processing method and device
US11630986B2 (en) Graph conversion method
WO2020062299A1 (en) Neural network processor, data processing method and related device
US20230126800A1 (en) Machine learning-based 2d structured image generation
EP4354349A1 (en) Halo transfer for convolution workload partition
WO2023141939A1 (en) Method and device for processing computing task
CN113672232A (en) Program compiling method and device
US11481604B2 (en) Apparatus and method for neural network processing
EP4439390A1 (en) Data processing method and apparatus
WO2023071658A1 (en) Ai model processing method and apparatus, and ai model computing method and apparatus
Eshratifar et al. Runtime deep model multiplexing for reduced latency and energy consumption inference
WO2021253440A1 (en) Depth-wise over-parameterization
CN115601513A (en) Model hyper-parameter selection method and related device
WO2024082679A1 (en) Method and apparatus for processing computational graph
US20240354162A1 (en) Graph orchestrator for neural network execution
Krasnoproshin et al. A New Approach to Building a Graphics Pipeline for Rendering

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280012811.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22922783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE