WO2021190127A1 - 一种数据处理方法和数据处理设备 - Google Patents

一种数据处理方法和数据处理设备 Download PDF

Info

Publication number
WO2021190127A1
WO2021190127A1 PCT/CN2021/074108 CN2021074108W WO2021190127A1 WO 2021190127 A1 WO2021190127 A1 WO 2021190127A1 CN 2021074108 W CN2021074108 W CN 2021074108W WO 2021190127 A1 WO2021190127 A1 WO 2021190127A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
arrangement
operator
equivalent
tensor arrangement
Prior art date
Application number
PCT/CN2021/074108
Other languages
English (en)
French (fr)
Inventor
苏腾
陈婷婷
杨振章
张晓达
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21776526.2A priority Critical patent/EP4123515A4/en
Publication of WO2021190127A1 publication Critical patent/WO2021190127A1/zh
Priority to US17/952,848 priority patent/US20230023101A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a data processing method and data processing equipment.
  • Data parallelism is the most widely used parallel strategy.
  • single-card memory is limited, and the number of training devices continues to increase, resulting in increased communication overhead, data parallelism encounters bottlenecks, and data and models are required Mixed parallel.
  • the parallel scheme of the deep learning model can be embodied according to the tensor arrangement of all operators in the model.
  • the tensor arrangement includes device matrix, tensor shape, and tensor mapping.
  • any dimension of the tensor can be segmented, but the device matrix of all tensors must be the same.
  • the Mesh-Tensorflow solution requires that the device matrices of all tensors be the same, there are constraints on tensor mapping. For example, the sample dimension of a tensor must be mapped to the same dimension of the device matrix of all tensors, which limits the conversion of multiple parallel methods. For example, hybrid parallelism of data parallelism and model parallelism cannot be realized.
  • the embodiment of the present application provides a data processing method, which is applied to a distributed cluster for training of a deep neural network model, so that hybrid parallelism composed of different parallel modes can be implemented in a distributed cluster.
  • the first aspect of the embodiments of the present application provides a data processing method.
  • the data processing method provided in the embodiments of the present application is generally applied to a distributed cluster.
  • the distributed cluster includes multiple data processing devices.
  • the method can be used in the distributed cluster.
  • the method includes: obtaining a deep neural network model, and the tensor arrangement of the input tensor of each operator in the deep neural network model and each operator
  • the tensor arrangement of the output tensor of, the tensor arrangement includes a device matrix, a tensor map, and a tensor shape.
  • Each element in the device matrix represents a data processing device in the distributed cluster, and all the devices in the device matrix Multiple data processing devices corresponding to the elements are used to execute the deep neural network model in parallel, and the deep neural network model includes a first operator and a second operator, and the first operator and the second operator are Two consecutive operators in the deep neural network model, and the output tensor of the first operator is the input tensor of the second operator, the first tensor arrangement and the second tensor arrangement Inconsistent, wherein the first tensor arrangement is the tensor arrangement of the output tensor of the first operator, and the second tensor arrangement is the tensor arrangement of the input tensor of the second operator Quantity arrangement; according to the tensor arrangement of the input tensor of each operator and the tensor arrangement of the output tensor of each operator, determine the slice calculation graph of the data processing device; determine the first A rearrangement operator between an operator and the second operator, where the rearrangement operator is used
  • the data processing device obtains the deep neural network model, and the tensor arrangement of all operators therein, including the tensor arrangement of the input tensor and the tensor arrangement of the output tensor, where , If there are two consecutive operators, the tensor arrangement of the output tensor of the first operator is inconsistent with the tensor arrangement of the input tensor of the second operator, then the slice calculation is calculated based on the tensor arrangement The graph cannot be executed.
  • the data processing device determines the rearrangement operator between the first operator and the second operator to convert the tensor arrangement of the output tensor of the first operator to the second The tensor arrangement of the input tensor of the operator. Therefore, the rearrangement operator is inserted into the slice calculation graph, and the determined and updated slice calculation graph can be executed.
  • This solution can be applied to the arbitrary tensor arrangement of each operator in the deep neural network model, and the parallel mode conversion is realized by rearranging the operator, so that various types of hybrid parallelism can be realized in distributed clusters.
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement, and/or the first tensor arrangement
  • the tensor mapping of is inconsistent; determining the rearrangement operator between the first operator and the second operator includes: determining according to the first tensor arrangement and the second tensor arrangement Intermediate tensor arrangement; determine the first reshaping operator according to the first tensor arrangement and the intermediate tensor arrangement, and the tensor arrangement of the input tensor of the first reshaping operator
  • the tensor arrangement of the output tensor of the first reshaping operator, the first reshaping operator is used to implement the conversion from the first tensor arrangement to the intermediate tensor arrangement, and/ Or determine a second reshaping operator according to the second tensor arrangement and the intermediate tensor arrangement, and the tensor arrangement of the input tensor of the second reshap
  • the tensor shape of the first tensor arrangement and the second tensor arrangement are the same, and at least one of the device matrix and the tensor map is inconsistent, how? Determine the rearrangement operator.
  • at least one reshaping operator can be generated through the intermediate tensor arrangement. After the reshaping operator is inserted between the first operator and the second operator, the first tensor arrangement can be made The cloth is converted to the second tensor arrangement, and the slice calculation graph is updated through the generated reshaping operator, which can reduce the user's workload and improve the efficiency of parallel training of the model.
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement; according to the first tensor arrangement and the The second tensor arrangement, determining the intermediate tensor arrangement includes: determining an expansion device matrix according to the device matrix of the first tensor arrangement and the device matrix of the second tensor arrangement, the expansion device matrix The product of the elements of is the same as the product of the elements of the device matrix of the output tensor of the first operator, and is the same as the product of the elements of the second device matrix of the input tensor of the second operator, the first Any element in the device matrix of the tensor arrangement and the device matrix of the second tensor arrangement is equal to one element in the expansion device matrix or the product of at least two elements of the expansion device matrix;
  • the expansion device matrix determines a first equivalent tensor arrangement equivalent to the first tensor arrangement, and a second equivalent tensor arrangement equivalent to the second tensor arrangement
  • the data processing method provided by the embodiment of this application specifically introduces how to determine the reshaping operator when the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement, specifically by determining the extension Device matrix, find the first equivalent tensor arrangement equivalent to the first tensor arrangement, and the second equivalent tensor arrangement equivalent to the second tensor arrangement, the first equivalent tensor arrangement
  • the device matrix of the quantity arrangement is consistent with the device matrix of the second equivalent tensor arrangement, and then a reshaping operator is generated according to the first equivalent tensor arrangement and the second equivalent tensor arrangement.
  • the reshaping operator is automatically generated to update the slice calculation graph, which can reduce the workload of the user to design the conversion between tensor arrangements and improve the efficiency of parallel training of the model.
  • the tensor shape of the first equivalent tensor arrangement is consistent with the tensor shape of the second equivalent tensor arrangement, and the tensor map of the first equivalent tensor arrangement is the same as that of the first equivalent tensor arrangement.
  • the tensor mappings of the second-equivalent tensor arrangement are inconsistent; the method further includes: determining one or more tensor mapping conversion operators, and the tensor mapping conversion operators include segmentation operators, merge operators, or communication operators.
  • the tensor arrangement of the input tensor of the one or more tensor mapping transformation operators is consistent with the first equivalent tensor arrangement, and the one or more tensor mapping transformation operators output tensors
  • the tensor arrangement of the quantity is consistent with the second equivalent tensor arrangement, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the data processing method provided by the embodiment of this application specifically introduces the situation that after the device matrix is normalized, if the tensor shape is the same but the tensor mapping is inconsistent, the data processing device determines the first tensor mapping conversion operator sequence, including a Or multiple tensor mapping conversion operators to realize the conversion of the tensor mapping of the first equivalent tensor arrangement to the tensor mapping of the second equivalent tensor arrangement, by generating one or more tensors
  • the mapping conversion operator updates the slice calculation graph, which can reduce the workload of the user to design the conversion between tensor arrangements and improve the efficiency of parallel training of the model.
  • the method when the tensor shape of the first equivalent tensor arrangement is inconsistent with the tensor shape of the second equivalent tensor arrangement, the method also Including: normalize the tensor shape according to the first equivalent tensor arrangement and the second equivalent tensor arrangement, and determine the third equivalent to the first equivalent tensor arrangement A price tensor arrangement, and a fourth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, and the device matrix of the third equivalent tensor arrangement is equivalent to the fourth equivalent tensor arrangement
  • the device matrix of the tensor arrangement is consistent, and the tensor shape of the third equivalent tensor arrangement is consistent with the tensor shape of the fourth equivalent tensor arrangement;
  • the intermediate tensor arrangement includes the The third equivalent tensor arrangement and the fourth equivalent tensor arrangement.
  • the data processing method provided by the embodiment of this application specifically introduces the scenario when the tensor shape is inconsistent after the device matrix of the first tensor arrangement and the second tensor arrangement are normalized, and the tensor shape normalization is required To determine a third equivalent tensor arrangement equivalent to the first equivalent tensor arrangement for determining a reshaping operator.
  • the workload of the user to design the conversion between tensor arrangements can be reduced, and the efficiency of parallel training of the model can be improved.
  • the tensor map of the third equivalent tensor arrangement is inconsistent with the tensor map of the fourth equivalent tensor arrangement; the method further includes: Determine one or more tensor mapping conversion operators, the tensor mapping conversion operators include segmentation operators, merge operators or communication operators, the one or more tensor mapping conversion operators input tensor
  • the tensor arrangement is consistent with the third equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping conversion operators is consistent with the fourth equivalent tensor arrangement,
  • the one or more tensor map conversion operators are used to determine the updated slice calculation graph.
  • the data processing method provided by the embodiment of this application specifically introduces when the device matrix of the first tensor arrangement and the second tensor arrangement are normalized, and after the tensor shape is normalized, if the tensor mapping is still inconsistent
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor of the first tensor arrangement
  • the shape is inconsistent with the tensor shape of the second tensor arrangement
  • the determining the intermediate tensor arrangement according to the first tensor arrangement and the second tensor arrangement includes: according to the first tensor arrangement The tensor arrangement and the second tensor arrangement determine a fifth equivalent tensor arrangement equivalent to the first tensor arrangement, and a fifth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement
  • the sixth equivalent tensor arrangement, the device matrix of the fifth equivalent tensor arrangement is consistent with the device matrix of the sixth equivalent tensor arrangement, and the tensor of the fifth equivalent tensor arrangement
  • the quantity shape is consistent with the tensor shape of the sixth equivalent tensor arrangement;
  • the intermediate tensor arrangement includes the fifth equivalent tensor arrangement and the sixth
  • the data processing method provided by the embodiment of this application specifically introduces the scenario where the device matrix of the first tensor arrangement and the second tensor arrangement are the same, and the tensor shape is inconsistent, the normalization of the tensor shape needs to be determined and the The fifth equivalent tensor arrangement equivalent to the first tensor arrangement, and the sixth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, the fifth equivalent tensor arrangement
  • the configuration and the sixth equivalent tensor arrangement are used to generate the reshaping operator. By generating the reshaping operator to update the slice calculation graph, the workload of the user to design the conversion between the tensor arrangement can be reduced, and the parallel training efficiency of the model can be improved. .
  • the tensor map of the fifth equivalent tensor arrangement is inconsistent with the tensor map of the sixth equivalent tensor arrangement; the method further includes: Determine one or more tensor mapping conversion operators, the tensor mapping conversion operators include segmentation operators, merge operators or communication operators, the one or more tensor mapping conversion operators input tensor
  • the tensor arrangement is consistent with the fifth equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping conversion operators is consistent with the sixth equivalent tensor arrangement,
  • the one or more tensor map conversion operators are used to determine the updated slice calculation graph.
  • the data processing method provided by the embodiment of this application specifically introduces that when the device matrices of the first tensor arrangement and the second tensor arrangement are consistent and the tensor shape is inconsistent, after the tensor shape is normalized, the tensor mapping is still Inconsistent scenes.
  • the data processing device also determines a third tensor mapping conversion operator sequence, including one or more tensor mapping conversion operators, for realizing the conversion from the fourth equivalent tensor arrangement to the fifth equivalent tensor arrangement. Updating the slice calculation graph through one or more generated tensor mapping transformation operators can reduce the workload of the user to design the conversion between tensor arrangements and improve the efficiency of parallel training of the model.
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor shape of the first tensor arrangement It is consistent with the tensor shape of the second tensor arrangement, and the tensor map of the first tensor arrangement is inconsistent with the tensor map of the second tensor arrangement; determining the first operator
  • the rearrangement operator between the second operator and the second operator includes: determining one or more tensor mapping conversion operators, the tensor mapping conversion operator including a segmentation operator, a merge operator, or a communication operator , The one or more tensor mapping conversion operators are used to input the output tensor of the first operator and output the input tensor of the second operator.
  • the data processing method provided by the embodiment of this application specifically introduces the scenario when the device matrix of the first tensor arrangement and the second tensor arrangement are consistent, the tensor shape is the same, and the tensor mapping is inconsistent, the data processing device also determines The fourth sequence of tensor mapping conversion operators includes one or more tensor mapping conversion operators for realizing the conversion from the first tensor arrangement to the second tensor arrangement. Since this solution can finally generate one or more tensor mapping transformation operators through the first tensor arrangement and the second tensor arrangement to update the slice calculation graph, it can reduce the workload of the user to design the conversion between tensor arrangements , Improve the efficiency of model parallel training.
  • a deep neural network model is obtained, as well as the tensor arrangement of the input tensor of each operator in the deep neural network model and the output tensor of each operator.
  • the tensor arrangement of the quantity includes: obtaining a deep neural network model and a segmentation strategy, the segmentation strategy includes the number of segmentation of the tensor of the deep neural network model in each dimension; according to the deep neural network model And the segmentation strategy determines the tensor arrangement of the input tensor of each operator in the deep neural network model and the tensor arrangement of the output tensor of each operator in the deep neural network model.
  • the data processing method provided by the embodiment of the present application and the specific method for the data processing device to obtain the tensor arrangement may be to first obtain the segmentation strategy, and generate the tensor of each operator according to the deep neural network model and the segmentation strategy.
  • the quantity arrangement provides another way to determine the arrangement of the operator tensor, which increases the flexibility of the implementation of the scheme.
  • the segmentation strategy includes a first segmentation strategy and a second segmentation strategy;
  • the deep neural network is determined according to the deep neural network model and the segmentation strategy
  • the tensor arrangement of the input tensor of each operator and the tensor arrangement of the output tensor of each operator in the model include: determining the first overall tensor arrangement corresponding to the first division strategy, And a second overall tensor arrangement corresponding to the second segmentation strategy, where the first overall tensor arrangement is the value of each operator in the deep neural network model determined based on the first segmentation strategy
  • the tensor arrangement of the input tensor and the tensor arrangement of the output tensor of each operator, and the second overall tensor arrangement is the deep neural network determined based on the second segmentation strategy
  • the tensor arrangement of the input tensor of each operator in the model and the tensor arrangement of the output tensor of each operator; the method further includes: from the first overall tens
  • the data processing device can obtain at least two segmentation strategies, and determine the overall tensor arrangement of each segmentation strategy according to different segmentation strategies. It should be noted that one segmentation strategy Can correspond to multiple global tensor arrangements. Compare different overall tensor arrangements to determine the first overall tensor arrangement with less overhead.
  • the cost refers to the sum of the communication time and calculation time for performing deep neural network model training based on the overall tensor arrangement. What needs to be explained is , The process of performing deep neural network model training based on the overall tensor arrangement needs to determine the rearrangement operator to be inserted on the basis of the overall tensor arrangement.
  • the slice calculation graph is determined based on the first overall tensor arrangement. This scheme can consider the overall tensor arrangement corresponding to multiple segmentation strategies, and select the overall tensor arrangement with low cost for slicing, which can reduce the cost of deep neural network model training.
  • the cost model of the first overall tensor arrangement is smaller than the cost model of the second overall tensor arrangement, and the cost of the first overall tensor arrangement
  • the model is based on the size of the data tensor, the size of the communication tensor and the size of the parameter tensor in the first overall tensor arrangement, as well as the weight coefficient of the size of the data tensor, and the weight coefficient of the communication tensor size.
  • the cost model of the second overall tensor arrangement is based on the size of the data tensor and the communication tensor in the second overall tensor arrangement The size and the size of the parameter tensor, as well as the value obtained by the weighted summation of the weight coefficient of the size of the data tensor, the weight coefficient of the communication tensor size, and the weight coefficient of the size of the parameter tensor.
  • the cost model when the first overall tensor arrangement is selected and determined, can be used to compare the costs of different overall tensor arrangements, and the storage and calculation costs can be approximated by tensors, which provides a comparison of differences.
  • a specific implementation method of the overall tensor arrangement overhead the weight coefficient of the size of the data tensor, the weight coefficient of the communication tensor size, and the size of the parameter tensor can be flexibly set The weight coefficient improves the flexibility of the realization of the scheme.
  • the segmentation strategy is a segmentation strategy specified by a user.
  • the data processing method provided by the embodiment of the present application may also be applicable to a scenario where a user specifies a segmentation strategy.
  • the input tensor of each operator in the deep neural network model includes a training data set
  • the training data set includes a text data set, an image data set, or an audio data set .
  • the input tensor of each operator in the deep neural network model includes a training data set
  • the training data set includes a text data set, an image data set, or an audio data set. It is used in the distributed training process of text translation model, speech recognition model, face recognition model, 3D reconstruction model and virtual reality model.
  • the corresponding deep neural network model can be used to realize automatic translation
  • the corresponding deep neural network model can be used to realize image recognition, face recognition or three-dimensional modeling.
  • a second aspect of the embodiments of the present application provides a data processing device, wherein the device includes:
  • the acquiring unit is used to acquire the deep neural network model, and the tensor arrangement of the input tensor of each operator in the deep neural network model and the tensor arrangement of the output tensor of each operator, so
  • the tensor arrangement includes a device matrix, a tensor map, and a tensor shape.
  • the deep neural network model includes a first operator and a second operator, and the first operator and the second operator are all Two consecutive operators in the deep neural network model, and the output tensor of the first operator is the input tensor of the second operator, and the arrangement of the first tensor is inconsistent with the arrangement of the second tensor ,
  • the first tensor arrangement is a tensor arrangement of the output tensor of the first operator
  • the second tensor arrangement is a tensor arrangement of the input tensor of the second operator Arrangement
  • a determining unit for determining a slice calculation graph according to the tensor arrangement of the input tensor of each operator and the tensor arrangement of the output tensor of each operator; the determining unit , Is also used to determine a rearrangement operator between the first operator and the second operator, and the rearrangement operator is used to convert the first tensor arrangement into the first tensor arrangement.
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement, and/or, the first tensor arrangement
  • the tensor mapping is inconsistent
  • the determining unit is specifically used for:
  • the tensor arrangement of the output tensor of the operator, the first reshaping operator is used to implement the conversion from the first tensor arrangement to the intermediate tensor arrangement, and/or
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement
  • the determining unit is specifically used for:
  • the expansion device matrix determine a first equivalent tensor arrangement equivalent to the first tensor arrangement, and a second equivalent tensor arrangement equivalent to the second tensor arrangement ,
  • the device matrix arranged by the first equivalent tensor is consistent with the device matrix arranged by the second equivalent tensor;
  • the intermediate tensor arrangement includes the first equivalent tensor arrangement
  • the cloth and the second equivalent tensor are arranged, and the tensor shape is the number of elements in each dimension of the tensor.
  • the tensor shape of the first equivalence tensor arrangement is consistent with the tensor shape of the second equivalence tensor arrangement, and the first equivalence tensor arrangement
  • the tensor map of the price tensor arrangement is inconsistent with the tensor map of the second equivalent tensor arrangement
  • the determining unit is further configured to determine one or more tensor mapping conversion operators, where the tensor mapping conversion operators include a split operator, a merge operator, or a communication operator, and the one or more tensor
  • the tensor arrangement of the input tensor of the mapping transformation operator is consistent with the first equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as that of the second tensor arrangement.
  • the arrangement of equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the determining unit It is also used to perform tensor shape normalization according to the first equivalent tensor arrangement and the second equivalent tensor arrangement, and determine a third equivalent tensor arrangement equivalent to the first equivalent tensor arrangement
  • the equivalent tensor arrangement, and the fourth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, and the device matrix of the third equivalent tensor arrangement is the same as the fourth equivalent tensor arrangement
  • the device matrix of the price tensor arrangement is consistent, and the tensor shape of the third equivalent tensor arrangement is consistent with the tensor shape of the fourth equivalent tensor arrangement;
  • the intermediate tensor arrangement includes all The third equivalent tensor arrangement and the fourth equivalent tensor arrangement.
  • the tensor map of the third equivalent tensor arrangement is inconsistent with the tensor map of the fourth equivalent tensor arrangement
  • the determining unit is further configured to determine one or more tensor mapping conversion operators, the tensor mapping conversion operators including a split operator, a merge operator, or a communication operator, and the one or more tensors
  • the tensor arrangement of the input tensor of the mapping transformation operator is consistent with the third equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as the fourth
  • the arrangement of equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor of the first tensor arrangement The shape is inconsistent with the tensor shape of the second tensor arrangement;
  • the determining unit is specifically configured to: determine a fifth equivalent tensor arrangement equivalent to the first tensor arrangement according to the first tensor arrangement and the second tensor arrangement, and A sixth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, the device matrix of the fifth equivalent tensor arrangement and the device matrix of the sixth equivalent tensor arrangement Consistent, the tensor shape of the fifth equivalent tensor arrangement is consistent with the tensor shape of the sixth equivalent tensor arrangement; the intermediate tensor arrangement includes the fifth equivalent tensor arrangement And the sixth equivalent tensor arrangement.
  • the tensor map of the fifth equivalent tensor arrangement is inconsistent with the tensor map of the sixth equivalent tensor arrangement;
  • the determining unit is further configured to determine one or more tensor mapping conversion operators, where the tensor mapping conversion operators include a split operator, a merge operator, or a communication operator, and the one or more tensor
  • the tensor arrangement of the input tensor of the mapping transformation operator is consistent with the fifth equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as that of the sixth equivalent tensor arrangement.
  • the arrangement of equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor shape of the first tensor arrangement The tensor shape of the second tensor arrangement is consistent, and the tensor map of the first tensor arrangement is inconsistent with the tensor map of the second tensor arrangement;
  • the determining unit is specifically used for:
  • the tensor mapping conversion operators include splitting operators, merging operators, or communication operators, and the one or more tensor mapping conversion operators are used to input all the The output tensor of the first operator, and output the input tensor of the second operator.
  • the acquiring unit is specifically configured to:
  • the determining module is specifically configured to determine the tensor arrangement of the input tensor of each operator in the deep neural network model and the output of each operator according to the deep neural network model and the segmentation strategy The tensor arrangement of the tensor.
  • the segmentation strategy includes a first segmentation strategy and a second segmentation strategy
  • the determining unit is specifically used for:
  • the first overall tensor arrangement and the second overall tensor arrangement determine that the first overall tensor arrangement is the tensor of the input tensor of each operator in the deep neural network model Arrangement and the tensor arrangement of the output tensor of each operator, the sum of the communication time and the calculation time required for the first overall tensor arrangement to perform the training of the deep neural network model is less than based on
  • the second overall tensor arrangement is the sum of the communication time and the calculation time required to perform the training of the deep neural network model.
  • the cost model of the first overall tensor arrangement is smaller than the cost model of the second overall tensor arrangement, and the cost of the first overall tensor arrangement
  • the model is based on the size of the data tensor, the size of the communication tensor and the size of the parameter tensor in the first overall tensor arrangement, as well as the weight coefficient of the size of the data tensor, and the weight coefficient of the communication tensor size. The value obtained by the weighted summation of the weight coefficient of the size of the parameter tensor;
  • the cost model of the second overall tensor arrangement is based on the size of the data tensor, the size of the communication tensor, the size of the parameter tensor, and the size of the data tensor in the second overall tensor arrangement
  • the weighting coefficient of the communication tensor, the weighting coefficient of the size of the communication tensor, and the weighting coefficient of the size of the parameter tensor are the values obtained by the weighted summation.
  • the segmentation strategy is a segmentation strategy specified by the user.
  • the input tensor of each operator in the deep neural network model includes a training data set
  • the training data set includes a text data set, an image data set, or an audio data set .
  • a third aspect of the embodiments of the present application provides a data processing device, including a processor and a memory, the processor and the memory are connected to each other, wherein the memory is used to store a computer program, and the computer program includes program instructions
  • the processor is configured to call the program instructions to execute the method described in any one of the foregoing first aspect and various possible implementation manners.
  • the fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which is characterized in that, when it runs on a computer, the computer is caused to execute any of the above-mentioned first aspect and various possible implementation manners. The method described in the item.
  • the fifth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which are characterized in that, when the instructions run on a computer, the computer executes the above-mentioned first aspect and various possible implementation manners. Any of the methods.
  • a sixth aspect of the embodiments of the present application provides a chip including a processor.
  • the processor is used to read and execute the computer program stored in the memory to execute the method in any possible implementation manner of any one of the foregoing aspects.
  • the chip should include a memory, and the memory and the processor are connected to the memory through a circuit or a wire.
  • the chip further includes a communication interface, and the processor is connected to the communication interface.
  • the communication interface is used to receive data and/or information that needs to be processed, and the processor obtains the data and/or information from the communication interface, processes the data and/or information, and outputs the processing result through the communication interface.
  • the communication interface can be an input and output interface.
  • a seventh aspect of the embodiments of the present application provides a distributed cluster, which is characterized in that the distributed cluster includes one or more of the data processing devices described in any of the second aspects and various possible implementation manners. .
  • the data processing device obtains the deep neural network model, and the tensor arrangement of all operators therein, including the tensor arrangement of the input tensor and the tensor arrangement of the output tensor, where , If there are two consecutive operators, the tensor arrangement of the output tensor of the first operator is inconsistent with the tensor arrangement of the input tensor of the second operator, then the slice calculation is calculated based on the tensor arrangement The graph cannot be executed.
  • the data processing device determines the rearrangement operator between the first operator and the second operator to convert the tensor arrangement of the output tensor of the first operator to the second The tensor arrangement of the input tensor of the operator. Therefore, the rearrangement operator is inserted into the slice calculation graph, and the determined and updated slice calculation graph can be executed.
  • This solution can be applied to the arbitrary tensor arrangement of each operator in the deep neural network model, and the parallel mode conversion is realized by rearranging the operator, so that various types of hybrid parallelism can be realized in distributed clusters.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • Figure 2 is a schematic diagram of an application environment provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a calculation graph provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of a distributed cluster topology structure provided by an embodiment of the application.
  • Figure 5 is a schematic diagram of an application scenario in an embodiment of the application.
  • Fig. 6 is a schematic diagram of an embodiment of a data processing method in an embodiment of the application.
  • FIG. 7 is a schematic diagram of an embodiment of tensor arrangement in an embodiment of this application.
  • FIG. 8 is a schematic diagram of another embodiment of a data processing method in an embodiment of this application.
  • Fig. 9 is a schematic diagram of an embodiment of generating a rearrangement operator in an embodiment of the application.
  • FIG. 10 is a schematic diagram of another embodiment of generating a rearrangement operator in an embodiment of the application.
  • FIG. 11 is a schematic diagram of an embodiment for determining the overall tensor arrangement in an embodiment of this application.
  • FIG. 12 is a schematic diagram of another embodiment for determining the overall tensor arrangement in an embodiment of this application.
  • FIG. 13 is a schematic diagram of an embodiment of a data processing device in an embodiment of the application.
  • FIG. 14 is a schematic diagram of another embodiment of a data processing device in an embodiment of this application.
  • FIG. 15 is a diagram of a chip hardware structure provided by an embodiment of the application.
  • the embodiment of the present application provides a data processing method, which is applied to a distributed cluster for training of a deep neural network model, and the tensor arrangement of each operator can realize the parallelism of the training process of the deep neural network model.
  • Deep neural network model also called model or network or algorithm in the embodiments of this application, is divided into forward algorithm part and reverse calculation part; forward propagation, or forward calculation part, forward calculation part, is the model The calculation process can give the corresponding output for a set of inputs; backpropagation, or reverse calculation part: it is to train the model parameters, and use gradient descent on all parameters to minimize the loss function of the model on the training data.
  • Calculation graph also called data flow graph.
  • Each calculation in the neural network is a node on the calculation graph, and the edges between nodes represent the dependence between the input and output of the data.
  • the slice calculation graph compared with the full calculation graph, the original nodes remain unchanged, and the data volume corresponding to the edges between the nodes is part of the complete data volume.
  • the slice calculation graph can also be used For rearranged nodes.
  • n-dimensional array which is an n-dimensional extension of scalar, 1-dimensional vector and 2-dimensional matrix.
  • training data and intermediate calculation results can be regarded as tensors.
  • Tensor shape a one-dimensional array composed of the number of elements in each dimension of the tensor.
  • Equipment matrix one-dimensional array. Express the arrangement of equipment. The number of array elements represents the dimension of the device arrangement. The product of the array elements is equal to the total number of devices.
  • Tensor mapping one-dimensional array. The number of elements is equal to the number of elements of the tensor shape. The value represents the mapping of the corresponding dimension segmentation of the tensor on the device matrix.
  • Tensor arrangement the arrangement of distributed tensors in various devices. It is jointly expressed by tensor shape, device matrix and tensor map.
  • Distributed tensor layout is the arrangement of the elements in the distributed tensor on each device. In the embodiments of this application, it is also referred to as tensor arrangement for short.
  • Tensor shape the complete shape, the number of elements in each dimension.
  • Device matrix Each operator designs a device matrix.
  • Tensor map A vector, which has the same dimensions as the shape of a tensor.
  • the term "and/or" appearing in this application can be an association relationship describing associated objects, indicating that there can be three types of relationships, for example, A and/or B, which can mean: A alone exists, and A and B exist at the same time , The situation where B exists alone, where A and B can be singular or plural.
  • the character "/" in this application generally indicates that the associated objects before and after are in an "or” relationship.
  • "at least one” refers to one or more
  • “multiple” refers to two or more.
  • "The following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • the data processing method provided by the embodiments of this application can be applied to the parallel training of deep neural network models in various distributed cluster scenarios.
  • the segmentation strategy can be determined independently for each operator, and the tensor arrangement can be generated.
  • the arrangement operator obtains a slice map of the deep neural network model executable by a single data processing device.
  • an embodiment of the present application provides a system architecture 200.
  • the data collection device 260 is used to collect data and store it in the database 230, and the training device 220 generates the target model/rule 201 based on the data maintained in the database 230.
  • the data may be text data, audio data, or image data. Images include pictures and videos. The specific data type is not limited here. The following will describe in more detail how the training device 220 obtains the target model/rule 201 based on the data.
  • the target model/rule 201 can be used in application scenarios such as text translation, speech recognition, face recognition, three-dimensional reconstruction, and virtual reality.
  • the target model/rule 201 may be obtained based on a deep neural network, and the deep neural network will be introduced below.
  • the work of each layer in the deep neural network can be expressed in mathematical expressions To describe: From the physical level, the work of each layer in the deep neural network can be understood as the transformation of the input space to the output space (that is, the row space of the matrix to the column Space), these five operations include: 1. Dimension Up/Down; 2. Enlarge/Reduce; 3. Rotate; 4. Translation; 5. "Bend”. The operations of 1, 2, and 3 are determined by Completed, the operation of 4 is completed by +b, and the operation of 5 is realized by a(). The reason why the word "space” is used here is because the object to be classified is not a single thing, but a class of things, and space refers to the collection of all individuals of this class of things.
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • This vector W determines the spatial transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, take the loss function as an example. The higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the target model/rule obtained by the training device 220 can be applied to different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices.
  • the "user" can input data to the I/O interface 212 through the client device 240.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the calculation module 211 uses the target model/rule 201 to process the input data. Taking three-dimensional modeling as an example, the calculation module 211 can analyze the input image or image sequence to restore the depth information of the target.
  • the correlation function module 213 can preprocess the image data in the calculation module 211.
  • the correlation function module 214 can preprocess the image data in the calculation module 211.
  • the I/O interface 212 returns the processing result to the client device 240 and provides it to the user.
  • the training device 220 can generate corresponding target models/rules 201 based on different data for different targets, so as to provide users with better results.
  • the user can manually specify the input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240.
  • the user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
  • the client device 240 may also serve as a data collection terminal to store the collected training data in the database 230.
  • Fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the training device 220, the execution device 210, and the client device 240 are separate devices.
  • the training device 220 and the execution device 210 may be the same physical device, and the physical device can implement this All functions of the training device 220 and the execution device 210; optionally, the execution device 210 and the client device 240 can also be the same physical device, and the physical device can implement all the functions of the execution device 210 and the client device 240; optional, The training device 220, the execution device 210, and the client device 240 are all the same physical device. All the functions of the physical device training device 220, the execution device 210, and the client device 240 are not limited here for the specific scenario architecture of the embodiment of the present application.
  • the existing parallel solution for deep learning models In the Mesh-Tensorflow solution, segmentation can be performed according to any dimension of all tensors in the model, but the device matrices of all tensors must be the same. Since the Mesh-Tensorflow solution requires that the device matrices of all tensors be the same, there are constraints on tensor mapping. For example, the sample dimension of a tensor must be mapped to the same dimension of the device matrix of all tensors, which limits the conversion of multiple parallel methods. For example, hybrid parallelism of data parallelism and model parallelism cannot be realized. It is impossible to perform tensor segmentation for each operator separately, and the overall communication efficiency of the parallel scheme is low.
  • FIG. 3 is a schematic diagram of a calculation graph provided by an embodiment of this application.
  • the layer in the deep neural network model can be regarded as an operator.
  • the deep neural network model includes multiple operators.
  • the first operator and the second operator are taken as examples to introduce the part of the deep neural network calculation graph.
  • the input tensor of the first operator includes the data tensor X and the parameter tensor W
  • the output tensor of the first operator is the input tensor of the second operator
  • the second operator also has an input tensor, namely The parameter tensor V
  • the output tensor of the second operator is the tensor Z.
  • the tensor of an operator includes an input tensor and an output tensor.
  • the number of tensors of the operator is not limited.
  • the operator can have an input parameter tensor or no input parameter tensor. It is not limited here, and Figure 3 is only a possible calculation diagram.
  • FIG. 4 is a schematic diagram of a distributed cluster topology structure provided by an embodiment of the application.
  • a distributed cluster usually includes multiple servers, and each server may include multiple data processing devices.
  • the data processing devices may specifically be CPUs, GPUs, or other types of processors, such as rising chips, which are not specifically limited here.
  • Figure 4 illustrates a possible distributed cluster topology.
  • the distributed cluster includes n servers, and each server has 8 data processing devices deployed, usually referred to as one machine with 8 cards, and the servers are connected through a switching network. For communication, it can be understood that the communication delay between servers is greater than the communication delay between data processing devices inside the servers.
  • the deep neural network can be deployed in the distributed cluster, and the deep neural network model is trained in parallel by multiple data processing devices in multiple servers.
  • Figure 5 is a schematic diagram of an application scenario in an embodiment of the application.
  • the deep neural network model is compiled through the python front-end graph and can be converted to obtain the full calculation graph.
  • the full calculation graph executed by a single machine is obtained.
  • the parallel scheme is generated according to the data processing method provided by the embodiment of the application, and the slice is obtained.
  • slice calculation graph is used in the part of a single data processing device to perform deep neural network model training, slice calculation graph is compiled through automatic differential graph optimization, and the execution graph can be obtained.
  • the data processing method provided by the embodiment of the present application is mainly used to generate a parallel scheme, and specifically includes an operator parallel segmentation modeling module, a cost model module, a parallel segmentation strategy, and a graph segmentation module.
  • the operator parallel segmentation modeling module is used to generate multiple candidate segmentation strategies and the overall operator tensor arrangement of the corresponding deep neural network model.
  • the cost model module can be determined from multiple candidate segmentation strategies according to the cost. Target segmentation strategy.
  • the parallel segmentation strategy module is used to insert rearrangement operators.
  • the graph segmentation module performs graphs based on the updated deep neural network model inserted by the rearrangement operator and the tensor arrangement of each operator. Divide to obtain a slice calculation graph.
  • FIG. 6 is a schematic diagram of an embodiment of the data processing method in the embodiment of the application.
  • the data processing equipment obtains the deep neural network model and segmentation strategy
  • the data processing device obtains a deep neural network model, which is also called a model or a network or an algorithm in the embodiments of the present application, and according to the obtained model, determines the full forward graph of the model executed by a single machine.
  • a deep neural network model which is also called a model or a network or an algorithm in the embodiments of the present application.
  • the data processing device obtains the segmentation strategy, and the segmentation strategy refers to the segmentation method of each tensor in the model, including the number of segmentation of the tensor in each dimension. It should be noted that the segmentation strategy can be specified by the user, or the segmentation strategy can be generated by the data processing device, and the specific segmentation strategy is not limited here.
  • the data processing device generates multiple candidate segmentation strategies according to the model and device topology information, and determines the optimal segmentation strategy from the multiple candidate segmentation strategies. For specific implementation methods, please refer to subsequent embodiments.
  • step 601 obtaining the segmentation strategy in step 601 is an optional operation. If the data processing device generates multiple candidate segmentation strategies according to the model and device topology information, and then obtains multiple overall tensor arrangements, the target overall tensor is determined therefrom. For arrangement, step 603 can be performed directly.
  • the data processing device determines the tensor arrangement of each operator according to the deep neural network model and the segmentation strategy;
  • Distributed tensor layout which is the arrangement of the elements in the distributed tensor on each device, consists of a device matrix (device matrix), a tensor map (tensor map) and a tensor shape (tensor shape)
  • the distributed tensor arrangement is abbreviated as tensor arrangement.
  • the tensor shape refers to a one-dimensional array composed of the number of elements in each dimension of the tensor.
  • the equipment matrix is used to express the arrangement of equipment. It is a one-dimensional array.
  • the number of array elements represents the dimension of the equipment arrangement.
  • the product of the array elements is equal to the total number of equipment.
  • Tensor mapping a one-dimensional array, the number of elements is equal to the number of elements of the tensor shape, and the value of each element represents the mapping of each dimension of the corresponding tensor on the device matrix.
  • the expression of tensor arrangement can support the needs of tensor arrangement of operators in various parallel modes.
  • the data processing device determines the tensor arrangement of each operator, and then obtains the tensor slice of the data processing device to obtain the slice calculation graph.
  • the data processing equipment can determine the tensor arrangement of each operator, including determining the tensor of all operators according to the deep neural network model, including each tensor
  • the number of dimension elements that is, the shape of the tensor
  • the segmentation strategy that is, the number of segments of the tensor in each dimension
  • the device topology of the distributed cluster including the number of data processing devices in each server, and the number of data processing devices in each server.
  • the connection relationship, etc. can determine the device matrix and tensor map for determining the tensor of each operator.
  • the data processing device can generate the segmentation strategy specified by the user, and generate the tensor arrangement of each operator according to preset rules, or according to the user
  • the specified segmentation strategy generates multiple overall tensor arrangements, the overall tensor arrangement is the tensor arrangement of each operator in the deep neural network model, and then one is determined from the multiple overall tensor arrangements Optimal overall tensor arrangement.
  • the data processing device may determine the optimal overall tensor arrangement corresponding to the segmentation strategy. For details, please refer to the subsequent embodiments.
  • FIG. 7 is a schematic diagram of an embodiment of tensor arrangement in the embodiment of this application.
  • the input tensor of the matrix multiplication operator is tensor A and tensor B, and the output tensor is tensor C,
  • Table 1 in Figure 7 shows the tensor arrangement of tensor A, tensor B, and tensor C. It can be seen that the device matrices of the three tensors are all [3, 2, 4], and the tensor shape of tensor A is [ h A , w A ], the tensor map is [2,1], representing the 0th dimension h A of the tensor A is mapped to the second dimension of the device matrix, and the first dimension w A is mapped to the first dimension of the device matrix.
  • the tensor shape of tensor B is [hB, wB]
  • the tensor map is [1, 0]
  • the tensor shape of tensor C is [hC, wC]
  • the tensor map is [2, 0].
  • steps 601 to 602 are optional steps, and the data processing device can directly obtain the deep neural network model before step 603, and the tensor permutation of the input tensor of each operator in the deep neural network model.
  • the tensor arrangement of the input tensor of each operator in the deep neural network model and the tensor arrangement of the output tensor of each operator in the deep neural network model the tensor of the data processing device can be obtained Slice to get the slice calculation graph.
  • the data processing device inserts a rearrangement operator between consecutive operators with different tensor arrangements.
  • the data processing device cannot execute the tensor slice determined in step 602.
  • the embodiment of the present application inserts a redistribution between the first operator and the second operator. operator.
  • Inconsistent tensor arrangement means that at least one of the device matrix, tensor map, and tensor shape is different.
  • the data processing device performs device matrix normalization to determine the first equivalent tensor arrangement of tensor_layout_from (tensor_layout_from2) and the second equivalent tensor arrangement of tensor_layout_to (tensor_layout_to2),
  • the device layout of tensor_layout_from2 is consistent with the device layout of tensor_layout_to2, and the conversion between equivalent tensor layouts can be realized by a reshape operator. For example, tensor_layout_from is converted to tensor_layout_from2 by a reshape operator.
  • the data processing device performs tensor shape normalization and determines the third equivalent of tensor_layout_from Tensor layout (tensor_layout_from3), and the fourth equivalent tensor layout of tensor_layout_to (tensor_layout_to3), the tensor shape of tensor_layout_from3 is consistent with the tensor shape of tensor_layout_to3, and the conversion between equivalent tensor layouts can be reshaped
  • the (reshape) operator is implemented, for example, tensor_layout_from2 is converted to tensor_layout_from3 through the reshape operator.
  • the device matrix of tensor_layout_from and tensor_layout_to is the same, and the tensor shape is the same, and the tensor mapping is inconsistent; or the device arrangement of tensor_layout_from2 and tensor_layout_to2 is the same, and the tensor shape is the same, and the tensor mapping is inconsistent; or the device row of tensor_layout_from3 and tensor_layout_to3
  • the distribution is consistent, and the tensor shape is the same, and the tensor mapping is inconsistent; insert the tensor mapping transformation such as slice operator, concat operator or communication operator between tensors with inconsistent tensor shapes operator.
  • Communication operators include alltoall operators and allgather operators.
  • the tensor mapping conversion operator the conversion between different tensors of the tensor mapping can be realized. For example, the conversion between tensor_layout_from3 to tensor_layout_to3. It is understandable that the tensor arrangement of operators can be obtained when generating communication operators or segmentation operators.
  • the data processing device will generate one or more of the reshape operator, slice operator, concat operator, alltoall operator, and allgather operator, that is, the redistribution operator can Including a single operator, or a sequence of rearrangement operators composed of multiple operators, the specific number and type of operators are not limited here.
  • the redistribution operator can Including a single operator, or a sequence of rearrangement operators composed of multiple operators, the specific number and type of operators are not limited here.
  • first operators and second operators there may be multiple sets of first operators and second operators in the deep neural network model. Inserting rearrangement operators between each set of first operators and second operators can determine the executable Updated deep neural network model.
  • step 601 the segmentation strategy is determined by the data processing device from a variety of candidate segmentation strategies, the data processing device can directly obtain that a rearrangement operator needs to be inserted between the first operator and the second operator In this step, the rearrangement operator can be inserted between the first operator and the second operator in sequence order.
  • the data processing device updates the slice calculation graph according to the rearrangement operator, and determines the updated slice calculation graph.
  • the data processing device updates the slice calculation graph according to the rearrangement operator, and then generates the slice execution graph according to the graph compilation process, through automatic differentiation, graph optimization and other processes.
  • a single data processing device executes the deep neural network model based on the corresponding slice execution graph. A part of the training process.
  • each data processing device in the distributed cluster can obtain a slice calculation graph of the data processing device, generate an execution graph from the slice calculation graph, and execute a part of the deep neural network model training process.
  • the data processing device obtains the deep neural network model, and the tensor arrangement of all operators therein, including the tensor arrangement of the input tensor and the tensor arrangement of the output tensor, where , If there are two consecutive operators, the tensor arrangement of the output tensor of the first operator is inconsistent with the tensor arrangement of the input tensor of the second operator, then the slice calculation is calculated based on the tensor arrangement The graph cannot be executed.
  • the data processing device in this application determines the rearrangement operator between the first operator and the second operator to convert the tensor arrangement of the output tensor of the first operator into the second operator
  • the tensor arrangement of the input tensor of the sub thus, the rearrangement operator is inserted in the slice calculation graph, and the determined and updated slice calculation graph can be executed.
  • This solution can be applied to any given operator tensor arrangement, and the parallel mode conversion is realized by rearranging the operator, so that all kinds of hybrid parallelism can be realized in a distributed cluster.
  • the data processing method provided by the embodiment of the present application supports a flexible tensor segmentation parallel mode.
  • Each operator is independently modeled and can be divided in all dimensions of the tensor.
  • the tensor arrangement is converted between operators by rearranging the operator sequence.
  • the existing Mesh-Tensorflow method models the entire deep neural network and cannot support the conversion of parallel modes between operators.
  • hybrid parallel mode 1 Channel model parallel to batch data parallel, which is a common parallel training method for deep learning recommendation model (deep learning recommendation model, DLRM).
  • DLRM deep learning recommendation model
  • the Mesh-Tensorflow method cannot support the free switching of parallel modes between operators, but this application can generate any rearrangement operator required for tensor arrangement conversion through the rearrangement operator generating device. Supports flexible segmentation strategy configuration at the operator level.
  • Hybrid Parallel Mode 2 Data Parallel Overlay Model Parallel is a commonly used parallel mode in Transformer networks.
  • Existing solutions cannot support full-dimensional segmentation of tensors, and therefore cannot support such segmentation scenarios. This application can support the segmentation of each dimension of the tensor.
  • FIG. 8 is a schematic diagram of another embodiment of the data processing method in the embodiment of this application.
  • FIG. 8 illustrates the overall flow of the data processing method in the embodiment of the present application.
  • the network model is searched through the cost model and the parallel strategy, where the parallel strategy can mix data parallelism and model parallelism. After obtaining multiple slice calculation graphs according to the parallel strategy, they are executed by multiple data processing devices in the distributed cluster through the communication network.
  • FIG. 9 is a schematic diagram of an embodiment of generating a rearrangement operator in an embodiment of this application.
  • the first operator and the second operator are two continuous operators in the full forward graph of the deep neural network model, and the tensor layout (tensor_layout_from) of the output tensor of the first operator and the input of the second operator If the tensor layout (tensor_layout_to) of the tensor is inconsistent, the deep neural network model cannot perform segmentation and execution according to the tensor layout determined in step 602.
  • the embodiment of the present application adopts a method between the first operator and the second operator. Insert a redistribution operator to update the slice calculation graph so that the updated slice calculation graph is executable.
  • Inconsistent tensor arrangement means that at least one of the device matrix, tensor map, and tensor shape is different.
  • step 901 If the device matrix of tensor_layout_from and tensor_layout_to are inconsistent, perform step 901 to step 905;
  • step 902 If the device matrix of tensor_layout_from and tensor_layout_to are consistent, and the tensor shape is inconsistent, perform step 902 to step 905;
  • step 903 If the device matrix of tensor_layout_from and tensor_layout_to are consistent, the tensor shape is consistent, and the tensor mapping is inconsistent, perform step 903 to step 905;
  • the data processing device determines the equivalent arrangement of tensor_layout_from under the expansion device matrix tensor_layout_from2, and the equivalent arrangement of tensor_layout_to under the expansion device matrix tensor_layout_to2;
  • the data processing device determines the first equivalent tensor arrangement tensor_layout_from2 equivalent to tensor_layout_from and the second equivalent tensor arrangement tensor_layout_to2 equivalent to tensor_layout_to according to the device matrix of tensor_layout_from and the device matrix of tensor_layout_to.
  • an implementation method for determining tensor_layout_from2 and tensor_layout_to2 is as follows:
  • data processing apparatus 1) based on different devices tensor_layout_from tensor_layout_to matrix and matrix devices, data processing apparatus according to a first device of the matrix tensor_layout_from (device_matrix_from) [A 0, A 1, ..., A n] and the second device tensor_layout_to matrix (device_matrix_to) [ B 0 , B 1 ,..., B n ] determine the cumulative device matrix of the first device matrix and the cumulative device matrix of the second device matrix.
  • the cumulative device matrix of the first device matrix [A 0 , A 1 ,..., A n ] is [A 0 A 1 ...A n , A 1 ...A n ,..., A n ];
  • the cumulative device matrix of the second device matrix [B 0 , B 1 ,..., B m ] is [B 0 B 1 ...B n , B 1 ...B n ,..., B m ].
  • the union device_matrix_equal2_accum of the cumulative device matrix of the first device matrix and the cumulative device matrix of the second device matrix obtain the minimum cumulative normalized extended device matrix [C 0 C 1 ...C k , C 1 ... C k , ..., C k ]; k is a positive integer greater than or equal to 1, C 0 C 1 ...C k , C 1 ...C k , ..., C k is the element in the minimum cumulative normalized expansion device matrix, the number of elements is equal to k+ 1. Specifically, it is the number of elements in the cumulative device matrix of the first device matrix and the cumulative device matrix of the second device matrix after repetitive elements are removed.
  • the minimum cumulative normalized expansion device matrix [C 0 C 1 ...C k , C 1 ...C k ,..., C k ] is obtained .
  • determining the equivalent tensor layout of the first tensor layout (tensor_layout_from) according to the minimum normalized expansion device matrix includes:
  • tensor_shape [s[N-1], s[N-2],...,s[0]]; s[N-1], s[N-2],..., s[0] is the tensor shape Elements.
  • device_matrix [d[D-1], d[D-2], ..., d[0]]; d[D-1], d[D-2], ..., d[0] is the device matrix element.
  • tensor_map [m[N-1], m[N-2],..., m[0]]; m[N-1], m[N-2],..., m[0] is the tensor map Elements.
  • device_matrix_e [d[D- 1], d[D-2], ..., d[i+1], m, n, ..., d[0]].
  • tensor_map_e [me[N], me[N-1],..., me[k+1], me[k],..., me[0]];
  • tensor_shape_e [se[N-1], se[N-2],..., se[k+1], se[k],..., se[0]]; se[N-1], se[N- 2],...,se[k+1],se[k],...,se[0] are elements in tensor_shape_e.
  • the conversion between tensor_layout_from and tensor_layout_from2 can be realized by the reshape operator, and the conversion between tensor_layout_to and tensor_layout_to2 can also be realized by the reshape operator.
  • the data processing device implements tensor shape normalization according to tensor_layout_from2 and tensor_layout_to2, and determines a third equivalent tensor arrangement (tensor_layout_from3) equivalent to tensor_layout_from2, and a fourth equivalent tensor arrangement equivalent to tensor_layout_to2 Cloth (tensor_layout_to3);
  • tensor_layout_from2 the device matrix of the first equivalent tensor layout
  • tensor_layout_to2 the second equivalent tensor layout
  • tensor_layout_to2 the device matrix of the first equivalent tensor layout
  • tensor_layout_to2 the second equivalent tensor layout
  • tensor_shape_e [s[N- 1], s[N-2], ..., s[i+1], m, n, s[i-1], ..., s[0]].
  • device_matrix_e [de[D], de[D-1],..., de[0]];
  • tensor_map_e [me[N], me[N-1],..., me[i+1], me[i],..., me[0]]
  • the conversion between tensor_layout_from2 and tensor_layout_from3 can be realized by the reshape operator, and the conversion between tensor_layout_to2 and tensor_layout_to3 can also be realized by the reshape operator.
  • step 903 Before performing step 903, first determine whether the tensor mappings of tensor_layout_from3 and tensor_layout_to3 are consistent. If they are consistent, it means that the shape of the tensor slice is the same, then step 903 is not performed. Volume mapping, that is, inserting a tensor mapping conversion operator, that is, step 903 is executed;
  • step 902 may not be performed, and step 903 may be performed directly.
  • tensor_layout_from3 is equal to tensor_layout_from2 and tensor_layout_to3 is equal to tensor_layout_to2.
  • step 901 to step 902 may not be performed, and step 903 may be performed directly.
  • tensor_layout_from3 is equal to tensor_layout_from and tensor_layout_to3 is equal to tensor_layout_to.
  • the conversion between different tensors of the tensor mapping can be realized.
  • the data processing device will generate one or more of the reshape operator, the slice operator, the alltoall operator, and the allgather operator.
  • tensor_map_from3 is (-1, -1, -1, 3, 2, 1, 0) converted to tensor_map_to3 (4, -1, -1, 3, 2, 1, 0), the 0th dimension element is -1 becomes 4, insert the slice operator to split the 0th dimension of the tensor.
  • tensor_map is converted from (4, -1, -1, 3, 2, 1, 0) to (4, 3, -1, -1, 2, 1, 0), the first dimension and the third are For element exchange, the alltoall operator is inserted.
  • tensor_layout_from3 can be converted to tensor_layout_to3.
  • the reshape operator is determined.
  • the rearrangement sequence can be determined; the rearrangement sequence is the sequence of operators inserted in order between the first operator and the second operator.
  • the rearrangement sequence includes one or more operators. The specific number is not here. Make a limit.
  • the rearrangement sequence includes all rearrangement step operators determined according to step 901 to step 904, including one or more of the segmentation operator, the merge operator, the reshape operator, and the communication operator.
  • the first operator and the second operator are two continuous operators in the full forward graph of the deep neural network model, and the tensor layout of the output tensor of the first operator (tensor_layout_from) and the input of the second operator
  • the tensor layout (tensor_layout_to) of the quantity is inconsistent. According to the rearrangement sequence, the tensor layout of the output tensor of the first operator (tensor_layout_from) can be converted to the tensor layout of the input tensor of the second operator (tensor_layout_to).
  • the rearrangement operator generating device provided by the embodiment of the present application can transform any tensor arrangement to generate a required rearrangement operator. Based on the tensor layout expression, a rearrangement operator is generated for any tensor layout conversion. The generated operator sequence has the least overhead.
  • FIG. 10 is a schematic diagram of another embodiment of generating a rearrangement operator in an embodiment of the application.
  • the embodiment of the application generates a rearrangement operator according to tensor_layout_from and tensor_layout_to, and updates the deep neural network model by inserting a redistribution operator between the first operator and the second operator, so that the depth of the update is
  • the neural network model is executable.
  • Rearrangement operators include reshape operators, communication operators, segmentation operators, and merge operators.
  • the rearrangement operators inserted between the first operator and the second operator can include reshape operators, communication operators, etc.
  • One or more types of operators, segmentation operators and merging operators are not specifically limited.
  • the data processing device determines the optimal segmentation strategy from multiple candidate segmentation strategies, and uses the overall tensor arrangement corresponding to the optimal segmentation strategy to generate a slice calculation graph, or obtains each segmentation strategy based on multiple candidate segmentation strategies.
  • the overall tensor arrangement corresponding to each candidate segmentation strategy is obtained, and multiple overall tensor arrangements are obtained, and the optimal overall tensor arrangement is determined from the multiple overall tensor arrangements to generate the slice calculation graph.
  • FIG. 11 is a schematic diagram of an embodiment of determining the overall tensor arrangement in an embodiment of the application.
  • the data processing device can determine all the tensors in the model according to the deep neural network model, and then can determine the tensor shape according to the number of elements of the tensor in each dimension.
  • the data processing device can also obtain the device topology information of the distributed cluster, and the cluster resource distribution for the deep neural network model, including the number of servers, the number of data processing devices on each server, and the connection relationship between the servers.
  • the data processing device may be, for example, a GPU, a CPU, or other types of processors according to the device, and the specific type is not limited here.
  • the topology information can be used to obtain the total number of data processing devices in the distributed cluster. The total number can be used to constrain the device matrix in the tensor arrangement.
  • the data processing device determines the tensor of the first operator according to the full forward graph, and the first calculation
  • the sub-tensor includes an input tensor and an output tensor, and the data processing device determines the same device matrix for the input tensor and output tensor of each operator according to the preset rules according to the number of devices.
  • each tensor shape of each tensor and the device topology information different candidate segmentation strategies can be determined by traversal, that is, the segmentation method of each tensor, including the number of segments of the tensor in each dimension .
  • traversal that is, the segmentation method of each tensor, including the number of segments of the tensor in each dimension .
  • To determine the mapping of the slices of all the dimensions of all tensors in the deep neural network model in the device cluster that is, determine the device matrix and tensor mapping of the tensor under the segmentation strategy, so as to obtain multiple deep neural network models
  • the overall tensor arrangement refers to the tensor arrangement of the input tensors of all operators in the deep neural network model and the tensor arrangement of the output tensors of all operators.
  • different preset rules are determined according to the operator type to determine the device matrix of the appropriate operator tensor.
  • Different types of operators such as matrix multiplication operators, tensor addition operators, convolution operators and Flexible maximum transfer function (softmax) operator, etc.
  • the tensor batch dimension is insensitive to communication delay bandwidth compared to other dimensions.
  • the network communication bandwidth between multiple GPUs inside the server is high, the communication delay is relatively high. Low, and the communication delay between servers is relatively high. Therefore, the batch dimension is divided into nodes in parallel and the model dimension is divided into nodes.
  • the device matrix is [D2, D1, D0].
  • mapping a tensor to a device matrix identify the dimension of the tensor and implicitly map the batch dimension of the tensor to the D0 dimension of the device matrix between nodes, that is, the non-batch dimension of the tensor, or the model dimension is mapped to D1 or D2 dimension.
  • FIG. 12 is a schematic diagram of another embodiment for determining the overall tensor arrangement in an embodiment of this application.
  • Figure 12 shows a parallel modeling process of a two-dimensional matrix multiplication operator. There are 16 devices, one machine and 4 cards.
  • the input tensor of the operator is a two-dimensional tensor M and a two-dimensional tensor N.
  • the output tensor of is a two-dimensional tensor Q.
  • the fourth server has E1, E2, E3 and E4 four data processing equipment.
  • Each element in the device matrix that is, each cube in the figure, corresponds to a data processing device. All elements in the device matrix are all data processing devices in the distributed cluster used to execute the deep neural network model in this application.
  • the device matrix [D2, D1, D0] in the figure is [2, 2, 4].
  • the tensor M is mapped to the D 0 and D 1 planes. Assuming that the rows of the tensor M are sample dimensions, the rows are mapped to the D 0 dimension. The sample dimensions have lower requirements for communication. Correspondingly, the data processing equipment between the servers is arranged in D 0 dimension, other dimensions arrange the equipment inside the server. Exemplarily, A1, B1, C1, and E1 are data processing devices in different servers. Therefore, they are arranged along the D 0 axis of the device matrix. A1 and A2 are two data processing devices in the same server. Therefore, along D1 Or D2 axis arrangement.
  • this solution is designed for different bandwidths between devices, and the cluster topology relationship is expressed through the device matrix, which can flexibly adapt to various hierarchical combination network topologies. Reduce communication delay.
  • the segmentation strategy is determined from the multiple candidate segmentation strategies, and the tensor arrangement of each operator in the deep neural network model is obtained according to the candidate segmentation strategy, and continuous tensors with different tensor arrangements are generated.
  • the tensor arrangement of all the operators of the deep neural network model can be obtained, which will be referred to as the overall tensor arrangement hereinafter. It is understandable that multiple overall tensor arrangements can be obtained according to the segmentation strategy.
  • the cost model is calculated.
  • the cost model also considers the cost of operators and rearrangement operators. Specifically, the storage cost, calculation cost, and communication cost of the operator are approximated by the tensor shape. ; The ratio of calculation and communication overhead in the loss function is controlled by weighting coefficients to adapt to different equipment platforms. From a variety of candidate segmentation strategies, the candidate segmentation strategy with the smallest cost model is determined as the target segmentation strategy, which is used to perform graph segmentation.
  • an optimal overall tensor arrangement is determined from a plurality of overall tensor arrangements, and used to implement the embodiment corresponding to FIG. 6.
  • an optimal segmentation strategy is determined according to a plurality of overall tensor arrangements, and an overall tensor arrangement is generated according to preset rules according to the optimal segmentation strategy, which is used to implement the embodiment corresponding to FIG. 6.
  • a complete single-machine forward calculation graph output a slice calculation forward graph, and insert a rearrangement operator into the slice forward graph.
  • the output slice calculation forward graph generates a reverse calculation graph through automatic differentiation.
  • determining the candidate segmentation strategy with the smallest loss through the cost model among the multiple candidate segmentation strategies is determined according to the size of the data tensor, the size of the communication tensor and the size of the parameter tensor in the overall tensor arrangement, And the value obtained by the weighted summation of the weight coefficient of the size of the data tensor, the weight coefficient of the communication tensor size, and the weight coefficient of the parameter tensor size, the size of the data tensor, the size of the communication tensor And the size of the parameter tensor are respectively the storage space of the data tensor, the storage space of the communication tensor, and the storage space of the parameter tensor required to execute the deep neural network model based on the first candidate segmentation strategy.
  • alpha represents the weight coefficient of the size of the data tensor
  • beta represents the weight coefficient of the communication tensor size
  • gamma represents the weight coefficient of the parameter tensor size.
  • alpha, beta, and gamma are not limited here.
  • the values of alpha, beta, and gamma can be flexibly set.
  • the operators in the cost model include a sequence of rearrangement operators. Different segmentation strategies have different operator costs, and the costs of rearranging operator sequences between operators are also different.
  • the overhead here includes storage and communication overhead. The upper limit of the storage overhead is set according to the memory size of the actual device, and the operator segmentation strategy combination with the smallest computational overhead is found.
  • the shape of the input tensor of the forward operator and the reverse operator approximates the storage and calculation overhead. Adjust the ratio of alpha, beta, and gamma to adapt to different hardware platforms. For example, increasing beta means increasing the proportion of communication overhead and reducing the amount of communication required by the searched strategy.
  • the method for determining the overall tensor arrangement can independently model the operator, and configure the tensor arrangement of the input tensor and the tensor arrangement of the output tensor of the operator. Between operators, the rearrangement operator generation module generates the rearrangement operator required for the tensor arrangement conversion.
  • the cost model considers the operator and rearrangement operator costs at the same time, and approximates the storage, calculation, and communication costs of the operator with a tensor shape; the weighted coefficient controls the ratio of calculation and communication costs in the loss function to adapt to different device platforms.
  • this solution can input a complete single-machine forward calculation graph, output a slice calculation forward graph, and insert other operators such as rearrangement and AllReduce into the slice forward graph.
  • the output slice calculation forward graph generates a reverse calculation graph through automatic differentiation.
  • the method for determining the overall tensor arrangement provided by the embodiment of the present application, combined with the parallel mode configuration of the overall automatic parallel process provided by the corresponding embodiment of FIG. 6 and FIG. 9, is flexible, and each operator is independently modeled and inserted between operators
  • the rearrangement of the operator sequence can adapt to the hybrid and parallel requirements of various common networks, and make up for the lack of support for network types in the prior art.
  • the cost model considers the cost of operators and rearrangement at the same time, and can search for a parallel solution with the smallest overall cost. Use the tensor shape to approximate the calculation cost, and better approximate the actual cost without a lot of testing, so that it is platform-independent.
  • the proportion of communication calculation is controlled by weighting coefficient to adapt to different equipment platforms.
  • Tensor rearrangement supports the conversion of any tensor layout, and the conversion overhead is small.
  • the support of arbitrary tensor arrangement conversion allows operators to be independently and flexibly modeled.
  • Tensor rearrangement in addition to the arrangement conversion between operators, can also be used for the distributed implementation of reshape operators.
  • the topology-aware scheduling involved in this application by adjusting the device matrix in the Tensor Layout, batch dimensions that are not sensitive to communication delay bandwidth are placed in parallel between the servers, and the model segmentation is placed inside the servers. Through the equipment matrix configuration, it is simple and flexible to adapt to different cluster topologies.
  • FIG. 13 is a schematic diagram of an embodiment of a data processing device in an embodiment of this application.
  • the data processing device provided in the embodiment of the present application is applied to a distributed cluster, and the device includes:
  • the acquiring unit 1301 is configured to acquire a deep neural network model, and the tensor arrangement of the input tensor of each operator in the deep neural network model and the tensor arrangement of the output tensor of each operator,
  • the tensor arrangement includes a device matrix, a tensor map, and a tensor shape
  • the deep neural network model includes a first operator and a second operator
  • the first operator and the second operator are Two consecutive operators in the deep neural network model
  • the output tensor of the first operator is the input tensor of the second operator, the first tensor arrangement and the second tensor arrangement Inconsistent, wherein the first tensor arrangement is the tensor arrangement of the output tensor of the first operator, and the second tensor arrangement is the tensor arrangement of the input tensor of the second operator Quantity arrangement;
  • the determining unit 1302 is configured to determine the slice calculation graph according to the tensor arrangement of the input tensor of each operator and the tensor arrangement of the output tensor of each operator;
  • the determining unit 1302 is further configured to determine a rearrangement operator between the first operator and the second operator, and the rearrangement operator is used to arrange the first tensor Converted to the second tensor arrangement;
  • the determining unit 1302 is further configured to insert the rearrangement operator into the slice calculation graph to determine an updated slice calculation graph, and the updated slice calculation graph is used to instruct to execute the deep neural network. Part of the network model.
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement, and/or the tensor mapping of the first tensor arrangement is inconsistent;
  • the determining unit 1302 is specifically configured to:
  • the tensor arrangement of the output tensor of the operator, the first reshaping operator is used to implement the conversion from the first tensor arrangement to the intermediate tensor arrangement; and/or,
  • the device matrix of the first tensor arrangement is inconsistent with the device matrix of the second tensor arrangement
  • the determining unit 1302 is specifically configured to:
  • the expansion device matrix determine a first equivalent tensor arrangement equivalent to the first tensor arrangement, and a second equivalent tensor arrangement equivalent to the second tensor arrangement ,
  • the device matrix arranged by the first equivalent tensor is consistent with the device matrix arranged by the second equivalent tensor;
  • the intermediate tensor arrangement includes the first equivalent tensor arrangement
  • the cloth and the second equivalent tensor are arranged, and the tensor shape is the number of elements in each dimension of the tensor.
  • the tensor shape of the first equivalent tensor arrangement is consistent with the tensor shape of the second equivalent tensor arrangement, and the tensor map of the first equivalent tensor arrangement Inconsistent with the tensor mapping arranged by the second equivalent tensor;
  • the determining unit 1302 is further configured to determine one or more tensor mapping conversion operators, where the tensor mapping conversion operators include segmentation operators, merging operators, or communication operators, and the one or more tensor mapping conversion operators.
  • the tensor arrangement of the input tensor of the quantity mapping transformation operator is consistent with the first equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as that of the first equivalent tensor arrangement.
  • the arrangement of the second equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the determining unit 1302 is further configured to The equivalent tensor arrangement and the second equivalent tensor arrangement are normalized to the tensor shape, and a third equivalent tensor arrangement equivalent to the first equivalent tensor arrangement is determined, and A fourth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, the device matrix of the third equivalent tensor arrangement and the device matrix of the fourth equivalent tensor arrangement Consistent, the tensor shape of the third equivalent tensor arrangement is consistent with the tensor shape of the fourth equivalent tensor arrangement; the intermediate tensor arrangement includes the third equivalent tensor arrangement And the fourth equivalent tensor arrangement.
  • the tensor map of the third equivalent tensor arrangement is inconsistent with the tensor map of the fourth equivalent tensor arrangement;
  • the determining unit 1302 is further configured to determine one or more tensor mapping conversion operators, where the tensor mapping conversion operators include segmentation operators, merging operators, or communication operators, and the one or more tensor mapping conversion operators.
  • the tensor arrangement of the input tensor of the quantity mapping transformation operator is consistent with the third equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as that of the first tensor arrangement.
  • the arrangement of the four equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor shape of the first tensor arrangement is the same as that of the second tensor arrangement.
  • the shape of the cloth tensor is inconsistent;
  • the determining unit 1302 is specifically configured to: determine a fifth equivalent tensor arrangement equivalent to the first tensor arrangement according to the first tensor arrangement and the second tensor arrangement, And a sixth equivalent tensor arrangement equivalent to the second equivalent tensor arrangement, the device matrix of the fifth equivalent tensor arrangement and the device of the sixth equivalent tensor arrangement
  • the matrix is consistent, the tensor shape of the fifth equivalent tensor arrangement is consistent with the tensor shape of the sixth equivalent tensor arrangement;
  • the intermediate tensor arrangement includes the fifth equivalent tensor The arrangement and the sixth equivalent tensor arrangement.
  • the tensor map of the fifth equivalent tensor arrangement is inconsistent with the tensor map of the sixth equivalent tensor arrangement;
  • the determining unit 1302 is further configured to determine one or more tensor mapping conversion operators, where the tensor mapping conversion operators include segmentation operators, merging operators, or communication operators, and the one or more tensor mapping conversion operators.
  • the tensor arrangement of the input tensor of the quantity mapping transformation operator is consistent with the fifth equivalent tensor arrangement, and the tensor arrangement of the output tensor of the one or more tensor mapping transformation operators is the same as the tensor arrangement of the first
  • the arrangement of the six equivalent tensors is consistent, and the one or more tensor mapping conversion operators are used to determine the updated slice calculation graph.
  • the device matrix of the first tensor arrangement is consistent with the device matrix of the second tensor arrangement, and the tensor shape of the first tensor arrangement is the same as that of the second tensor arrangement
  • the tensor shapes of are consistent, and the tensor map of the first tensor arrangement is inconsistent with the tensor map of the second tensor arrangement;
  • the determining unit 1302 is specifically configured to:
  • the tensor mapping conversion operators include splitting operators, merging operators, or communication operators, and the one or more tensor mapping conversion operators are used to input all the The output tensor of the first operator, and output the input tensor of the second operator.
  • the acquiring unit 1301 is specifically configured to:
  • the determining module is specifically configured to determine the tensor arrangement of the input tensor of each operator in the deep neural network model and the output of each operator according to the deep neural network model and the segmentation strategy The tensor arrangement of the tensor.
  • the segmentation strategy includes a first segmentation strategy and a second segmentation strategy
  • the determining unit 1302 is specifically configured to:
  • the first overall tensor arrangement and the second overall tensor arrangement determine that the first overall tensor arrangement is the tensor of the input tensor of each operator in the deep neural network model Arrangement and the tensor arrangement of the output tensor of each operator, the sum of the communication time and the calculation time required for the first overall tensor arrangement to perform the training of the deep neural network model is less than based on
  • the second overall tensor arrangement is the sum of the communication time and the calculation time required to perform the training of the deep neural network model.
  • the cost model of the first overall tensor arrangement is smaller than the cost model of the second overall tensor arrangement, and the cost model of the first overall tensor arrangement is based on the first overall tensor arrangement.
  • the size of the data tensor, the size of the communication tensor and the size of the parameter tensor in the quantity arrangement, as well as the weighting coefficient of the size of the data tensor, the weighting coefficient of the size of the communication tensor, and the weighting coefficient of the size of the parameter tensor Value obtained by weighted summation;
  • the cost model of the second overall tensor arrangement is based on the size of the data tensor, the size of the communication tensor, the size of the parameter tensor, and the size of the data tensor in the second overall tensor arrangement
  • the weighting coefficient of the communication tensor, the weighting coefficient of the size of the communication tensor, and the weighting coefficient of the size of the parameter tensor are the values obtained by the weighted summation.
  • the segmentation strategy is a segmentation strategy specified by the user.
  • the input tensor of each operator in the deep neural network model includes a training data set
  • the training data set includes a text data set, an image data set, or an audio data set.
  • FIG. 14 is a schematic diagram of another embodiment of the data processing device in the embodiment of this application.
  • the data processing device provided in this embodiment may be a processor or a server or a dedicated data processing device, etc.
  • the specific device form is not limited in the embodiment of the present application.
  • the data processing device 1400 may have relatively large differences due to different configurations or performances, and may include one or more processors 1401 and a memory 1402, and the memory 1402 stores programs or data.
  • the memory 1402 may be volatile storage or non-volatile storage.
  • the processor 1401 is one or more central processing units (CPUs), or graphics processing units (GPUs) or other special-purpose processors, such as Ascend, etc.
  • the CPU It can be a single-core CPU or a multi-core CPU.
  • the processor 1401 may communicate with the memory 1402, and execute a series of instructions in the memory 1402 on the data processing device 1400.
  • the data processing device 1400 also includes one or more wired or wireless network interfaces 1403, such as an Ethernet interface.
  • the data processing device 1400 may also include one or more power supplies; one or more input and output interfaces, which can be used to connect a display, a mouse, a keyboard, a touch screen device or a transmission
  • the input and output interfaces are optional components, which may or may not exist, and are not limited here.
  • FIG. 15 is a hardware structure diagram of a chip provided by an embodiment of this application.
  • the algorithm of the deep neural network involved in the embodiment of the present application can be executed in the NPU chip shown in FIG. 15.
  • Neural Network Processor NPU 50 The NPU is mounted as a coprocessor to the host CPU (Host CPU), and the Host CPU assigns tasks.
  • the core part of the NPU is the arithmetic circuit 503.
  • the arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 503 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 502 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit takes the matrix A data and matrix B from the input memory 501 to perform matrix operations, and the partial or final result of the obtained matrix is stored in the accumulator 508.
  • the unified memory 506 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 502 through the storage unit access controller 505 (direct memory access controller, DMAC).
  • the input data is also transferred to the unified memory 506 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 509.
  • the bus interface unit 510 (bus interface unit, BIU for short) is used for the instruction fetch memory 509 to obtain instructions from an external memory, and is also used for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 or to transfer the weight data to the weight memory 502 or to transfer the input data to the input memory 501.
  • the vector calculation unit 507 may include multiple arithmetic processing units, and if necessary, further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 507 can store the processed output vector in the unified buffer 506.
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 507 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 503, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 509 connected to the controller 504 is used to store instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502, and the fetch memory 509 are all On-Chip memories.
  • the external memory is private to the NPU hardware architecture.
  • each layer in the deep neural network that is, the operators in the embodiments of the present application, may be executed by the matrix calculation unit or the vector calculation unit 507.
  • the foregoing method embodiments of the present application may be applied to a processor, or the processor may implement the steps of the foregoing method embodiments.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • the steps of the foregoing method embodiments may be completed by hardware integrated logic circuits in the processor or instructions in the form of software.
  • the foregoing processor may be a central processing unit (CPU), a network processor (NP) or a combination of a CPU and NP, a digital signal processor (DSP), or an application specific integrated circuit (application integrated circuit).
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps in the method disclosed in this application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the apparatus may include multiple processors or the processors may include multiple processing units.
  • the processor may be a single-CPU (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the memory is used to store computer instructions executed by the processor.
  • the memory can be a storage circuit or a memory.
  • the memory may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • the memory may be independent of the processor, or may be a storage unit in the processor, which is not limited here. Although only one memory is shown in the figure, the device may also include multiple memories or the memory may include multiple storage units.
  • the transceiver is used to implement content interaction between the processor and other units or network elements.
  • the transceiver may be a communication interface of the device, a transceiver circuit or a communication unit, or a transceiver.
  • the transceiver may also be a communication interface or a transceiver circuit of the processor.
  • the transceiver may be a transceiver chip.
  • the transceiver may also include a sending unit and/or a receiving unit.
  • the transceiver may include at least one communication interface.
  • the transceiver may also be a unit implemented in the form of software.
  • the processor may interact with other units or network elements through a transceiver. For example, the processor obtains or receives content from other network elements through the transceiver. If the processor and the transceiver are two physically separate components, the processor can interact with other units of the device without going through the transceiver.
  • the processor, the memory, and the transceiver may be connected to each other through a bus.
  • the bus may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect standard
  • EISA extended industry standard architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and so on.
  • words such as “exemplary” or “for example” are used as examples, illustrations, or illustrations. Any embodiment or design solution described as “exemplary” or “for example” in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as “exemplary” or “for example” are used to present related concepts in a specific manner.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Optimization (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种数据处理方法,涉及人工智能领域,应用于分布式并行模型训练,例如文本翻译模型、语音识别模型、人脸识别模型、三维重建模型以及虚拟现实模型等的分布式训练,该方法可以支持混合并行在分布式集群中的实现。该方法包括:基于深度神经网络模型中每个算子的张量的张量排布,在具有输入输出依赖关系算子之间插入重排布算子,实现不同张量排布之间的转换,在切片计算图中插入重排布算子,确定更新后的切片计算图实现深度神经网络的并行模型训练。

Description

一种数据处理方法和数据处理设备
本申请要求于2020年3月27日提交中国专利局、申请号为202010231450.7、发明名称为“一种数据处理方法和数据处理设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法和数据处理设备。
背景技术
在多个计算设备上部署深度学习模型是训练大规模复杂模型的一种方式。数据并行是应用最广的并行策略,但随着数据集和模型越来越大,单卡内存受限,训练设备数量不断增加,导致通信开销增长,数据并行遇到瓶颈,需要进行数据和模型混合并行。
深度学习模型的并行方案可以根据模型中所有算子的张量排布体现,张量排布包括设备矩阵、张量形状和张量映射。现有的网格张量流(Mesh-Tensorflow)方案中,可以对张量的任意维度进行切分,但是所有张量的设备矩阵必须相同。
由于Mesh-Tensorflow方案要求所有张量的设备矩阵相同,导致张量映射存在约束,例如,张量的样本维度必须映射到所有张量的设备矩阵的同一维度,限制了多种并行方式的转换,例如数据并行和模型并行的混合并行无法实现。
发明内容
本申请实施例提供了一种数据处理方法,应用于深度神经网络模型的训练的分布式集群,使得不同并行模式构成的混合并行可以在分布式集群中实现。
本申请实施例第一方面提供一种数据处理方法,本申请实施例提供的数据处理方法通常应用于分布式集群,该分布式集群中包括多个数据处理设备,本方法可以用于该分布式集群中的一个或多个数据处理设备中,该方法包括:获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述张量排布包括设备矩阵、张量映射和张量形状,设备矩阵中的每个元素代表分布式集群中的一个数据处理设备,设备矩阵中所有元素对应的多个数据处理设备用于并行执行该深度神经网络模型,所述深度神经网络模型中包括第一算子和第二算子,所述第一算子和所述第二算子为所述深度神经网络模型中的两个连续算子,且所述第一算子的输出张量为所述第二算子的输入张量,第一张量排布与第二张量排布不一致,其中,所述第一张量排布为所述第一算子的输出张量的张量排布,所述第二张量排布为所述第二算子的输入张量的张量排布;根据所述每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,确定数据处理设备的切片计算图;确定所述第一算子和所述第二算子之间的重排布算子,所述重排布算子用于将所述第一张量排布转换为所述第二张量排布;在所述切片计算图中插入所述重排布算子,以确定更新后的切片计算图,所述更新后的切片计算图用于指示执行所述深度神经网络模型的部分。
本申请实施例提供的数据处理方法,数据处理设备获取深度神经网络模型,以及其中所有算子的张量排布,包括输入张量的张量排布和输出张量的张量排布,其中,若存在连续的 两个算子,第一算子的输出张量的张量排布与第二算子的输入张量的张量排布不一致,则根据该张量排布得到的切片计算图无法执行,本申请中,数据处理设备确定第一算子和第二算子之间的重排布算子,用于将第一算子的输出张量的张量排布转换为第二算子的输入张量的张量排布,由此,在切片计算图中插入重排布算子,确定的更新后的切片计算图可以被执行。该方案可以应用于深度神经网络模型中每个算子任意张量排布的情况,通过重排布算子实现并行方式的转换,使得各类混合并行可以在分布式集群中实现。
在第一方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致,和/或,所述第一张量排布的张量映射不一致;确定所述第一算子和所述第二算子之间的重排布算子包括:根据所述第一张量排布和所述第二张量排布,确定中间张量排布;根据所述第一张量排布和所述中间张量排布确定第一重塑算子,以及所述第一重塑算子的输入张量的张量排布和所述第一重塑算子的输出张量的张量排布,所述第一重塑算子用于实现所述第一张量排布至所述中间张量排布的转换,和/或根据所述第二张量排布和所述中间张量排布确定第二重塑算子,以及所述第二重塑算子的输入张量的张量排布和所述第二重塑算子的输出张量的张量排布,所述第二重塑算子位于所述第一重塑算子和所述第二算子之间,所述第二重塑算子用于实现所述中间张量排布到所述第二张量排布的转换。
本申请实施例提供的数据处理方法,针对张量排布中,第一张量排布和第二张量排布的张量形状一致,设备矩阵和张量映射中至少一个不一致的场景,如何确定重排布算子,具体可以通过中间张量排布,生成至少一个重塑算子,在第一算子和第二算子之间插入重塑算子后,可以使得第一张量排布转换为第二张量排布,通过生成的重塑算子更新切片计算图,可以减少用户的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致;根据所述第一张量排布和所述第二张量排布,确定中间张量排布包括:根据所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵,确定拓展设备矩阵,所述拓展设备矩阵的元素之积与所述第一算子的输出张量的设备矩阵的元素之积相同,与所述第二算子的输入张量的第二设备矩阵的元素之积相同,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵中任一元素,等于所述拓展设备矩阵中的一个元素或者所述拓展设备矩阵的至少两个元素的乘积;根据所述拓展设备矩阵,确定与所述第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,所述第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致;当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致时,所述中间张量排布包括所述第一等价张量排布和所述第二等价张量排布,所述张量形状为张量每个维度的元素数量。
本申请实施例提供的数据处理方法,具体介绍了当第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致时,如何确定重塑算子,具体是通过确定拓展设备矩阵,找到与第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致,进而根据第一等价张量排布和所述第二等价张量排布生成重塑算子。基于该方案自动生成重塑算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致,且所述第 一等价张量排布的张量映射与所述第二等价张量排布的张量映射不一致;所述方法还包括:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第一等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第二等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
本申请实施例提供的数据处理方法,具体介绍了设备矩阵归一化之后,若张量形状一致,但是张量映射不一致的情况,数据处理设备确定第一张量映射转换算子序列,包括一个或多个张量映射转换算子,实现第一等价张量排布的张量映射到所述第二等价张量排布的张量映射的转换,通过生成的一个或多个张量映射转换算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状不一致时,所述方法还包括:根据所述第一等价张量排布和所述第二等价张量排布进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,以及与所述第二等价张量排布等价的第四等价张量排布,所述第三等价张量排布的设备矩阵与所述第四等价张量排布的设备矩阵一致,所述第三等价张量排布的张量形状与所述第四等价张量排布的张量形状一致;所述中间张量排布包括所述第三等价张量排布和所述第四等价张量排布。
本申请实施例提供的数据处理方法,具体介绍了当第一张量排布与第二张量排布的设备矩阵归一化之后,张量形状不一致时的场景,需要进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,用于确定重塑算子。通过生成重塑算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,所述第三等价张量排布的张量映射与所述第四等价张量排布的张量映射不一致;所述方法还包括:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第三等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第四等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
本申请实施例提供的数据处理方法,具体介绍了当第一张量排布与第二张量排布的设备矩阵归一化,和张量形状归一化之后,若张量映射仍不一致时的场景下,还需确定第二张量映射转换算子序列,包括一个或多个张量映射转换算子,用于实现第三等价张量排布到第四等价张量排布的转换。通过生成的一个或多个张量映射转换算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵一致,且所述第一张量排布的张量形状与所述第二张量排布的张量形状不一致;所述根据所述第一张量排布和所述第二张量排布,确定中间张量排布包括:根据所述第一张量排布和所述第二张量排布确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,所述第五等价张量排布的设备矩阵与所述第六等价张量排布的设备矩阵一致,所述第五等价张量排布的张量形状与所述第六等价张量排布 的张量形状一致;所述中间张量排布包括所述第五等价张量排布和所述第六等价张量排布。
本申请实施例提供的数据处理方法,具体介绍了第一张量排布和第二张量排布的设备矩阵一致,且张量形状不一致的场景,进行张量形状归一化需确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,第五等价张量排布和所述第六等价张量排布用于生成重塑算子,通过生成重塑算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,所述第五等价张量排布的张量映射与所述第六等价张量排布的张量映射不一致;所述方法还包括:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第五等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第六等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
本申请实施例提供的数据处理方法,具体介绍了第一张量排布与第二张量排布的设备矩阵一致且张量形状不一致时,进行张量形状归一化后,张量映射仍不一致的场景。数据处理设备还确定第三张量映射转换算子序列,包括一个或多个张量映射转换算子,用于实现第四等价张量排布到第五等价张量排布的转换。通过生成的一个或多个张量映射转换算子更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵一致,所述第一张量排布的张量形状与所述第二张量排布的张量形状一致,且所述第一张量排布的张量映射与所述第二张量排布的张量映射不一致;确定所述第一算子和所述第二算子之间的重排布算子包括:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子用于输入所述第一算子的输出张量,并输出所述第二算子的输入张量。
本申请实施例提供的数据处理方法,具体介绍了第一张量排布与第二张量排布的设备矩阵一致,张量形状一致,且张量映射不一致时的场景,数据处理设备还确定第四张量映射转换算子序列,包括一个或多个张量映射转换算子,用于实现第一张量排布到第二张量排布的转换。由于本方案可以通过第一张量排布和第二张量排布最终生成一个或多个张量映射转换算子用于更新切片计算图,可以减少用户设计张量排布间转换的工作量,提高模型并行训练效率。
在第一方面的一种可能的实现方式中,获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布包括:获取深度神经网络模型和切分策略,所述切分策略包括所述深度神经网络模型的张量在每个维度上的切分数量;根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布。
本申请实施例提供的数据处理方法,数据处理设备获取张量排布的具体方法,可以是先获取切分策略,根据所述深度神经网络模型和所述切分策略生成每个算子的张量排布,提供了确定算子张量排布的另一种实现方式,增加了方案实现的灵活性。
在第一方面的一种可能的实现方式中,所述切分策略包括第一切分策略和第二切分策略; 根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子输入张量的张量排布和所述每个算子的输出张量的张量排布包括:确定所述第一切分策略对应的第一整体张量排布,以及所述第二切分策略对应的第二整体张量排布,所述第一整体张量排布为基于所述第一切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第二整体张量排布为基于所述第二切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布;所述方法还包括:从所述第一整体张量排布和所述第二整体张量排布中确定,所述第一整体张量排布为所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第一整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和,小于基于所述第二整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和。
本申请实施例提供的数据处理方法,数据处理设备可以获取至少两个切分策略,根据不同的切分策略确定每个切分策略的整体张量排布,需要说明的是,一个切分策略可以对应于多个整体张量排布。比较不同的整体张量排布确定开销较小的第一整体张量排布,开销是指基于该整体张量排布执行深度神经网络模型训练的通信时间与计算时间之和,需要说明的是,基于该整体张量排布执行深度神经网络模型训练的过程需要确定在该整体张量排布基础上需插入的重排布算子。基于第一整体张量排布确定切片计算图。该方案可以考虑多种切分策略对应的整体张量排布,并选取其中开销小的整体张量排布用于切片,可以降低深度神经网络模型训练时的开销。
在第一方面的一种可能的实现方式中,所述第一整体张量排布的代价模型小于所述第二整体张量排布的代价模型,所述第一整体张量排布的代价模型为根据所述第一整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值;所述第二整体张量排布的代价模型为根据所述第二整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值。
本申请实施例提供的数据处理方法,在筛选确定第一整体张量排布时,可以通过代价模型比较不同整体张量排布的开销,通过张量近似估计存储和计算开销,提供了比较不同整体张量排布的开销的一种具体实现方法,此外,可以根据不同数据处理设备类型,灵活设定数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数,提高方案实现的灵活性。
在第一方面的一种可能的实现方式中,所述切分策略为用户指定的切分策略。
本申请实施例提供的数据处理方法,还可以适用于由用户指定切分策略的场景。
在第一方面的一种可能的实现方式中,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集。
本申请实施例提供的数据处理方法,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集,可以用于文本翻译模型、语音识别模型、人脸识别模型、三维重建模型以及虚拟现实模型等的分布式训练过 程。例如,若输入文本数据集,对应的深度神经网络模型,可用于实现自动翻译,若输入图像数据集,对应的深度神经网络模型,可用于实现图像识别,人脸识别或三维建模等。
本申请实施例第二方面提供了一种数据处理设备,其特征在于,所述设备包括:
获取单元,用于获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述张量排布包括设备矩阵、张量映射和张量形状,所述深度神经网络模型中包括第一算子和第二算子,所述第一算子和所述第二算子为所述深度神经网络模型中的两个连续算子,且所述第一算子的输出张量为所述第二算子的输入张量,第一张量排布与第二张量排布不一致,其中,所述第一张量排布为所述第一算子的输出张量的张量排布,所述第二张量排布为所述第二算子的输入张量的张量排布;确定单元,用于根据所述每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,确定切片计算图;所述确定单元,还用于确定所述第一算子和所述第二算子之间的重排布算子,所述重排布算子用于将所述第一张量排布转换为所述第二张量排布;所述确定单元,还用于在所述切片计算图中插入所述重排布算子,以确定更新后的切片计算图,所述更新后的切片计算图用于指示执行所述深度神经网络模型的部分。
在第二方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致,和/或,所述第一张量排布的张量映射不一致;
所述确定单元,具体用于:
根据所述第一张量排布和所述第二张量排布,确定中间张量排布;
根据所述第一张量排布和所述中间张量排布确定第一重塑算子,以及所述第一重塑算子的输入张量的张量排布和所述第一重塑算子的输出张量的张量排布,所述第一重塑算子用于实现所述第一张量排布至所述中间张量排布的转换,和/或
根据所述第二张量排布和所述中间张量排布确定第二重塑算子,以及所述第二重塑算子的输入张量的张量排布和所述第二重塑算子的输出张量的张量排布,所述第二重塑算子位于所述第一重塑算子和所述第二算子之间,所述第二重塑算子用于实现所述中间张量排布到所述第二张量排布的转换。
在第二方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致;
所述确定单元,具体用于:
根据所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵,确定拓展设备矩阵,所述拓展设备矩阵的元素之积与所述第一算子的输出张量的设备矩阵的元素之积相同,与所述第二算子的输入张量的第二设备矩阵的元素之积相同,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵中任一元素,等于所述拓展设备矩阵中的一个元素或者所述拓展设备矩阵的至少两个元素的乘积;
根据所述拓展设备矩阵,确定与所述第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,所述第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致;
当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致时,所述中间张量排布包括所述第一等价张量排布和所述第二等价张量排布,所述张量形状为张量每 个维度的元素数量。
在第二方面的一种可能的实现方式中,所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致,且所述第一等价张量排布的张量映射与所述第二等价张量排布的张量映射不一致;
所述确定单元,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第一等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第二等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
在第二方面的一种可能的实现方式中,当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状不一致时,所述确定单元还用于根据所述第一等价张量排布和所述第二等价张量排布进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,以及与所述第二等价张量排布等价的第四等价张量排布,所述第三等价张量排布的设备矩阵与所述第四等价张量排布的设备矩阵一致,所述第三等价张量排布的张量形状与所述第四等价张量排布的张量形状一致;所述中间张量排布包括所述第三等价张量排布和所述第四等价张量排布。
在第二方面的一种可能的实现方式中,所述第三等价张量排布的张量映射与所述第四等价张量排布的张量映射不一致;
所述确定单元还用于:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第三等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第四等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
在第二方面的一种可能的实现方式中,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵一致,且所述第一张量排布的张量形状与所述第二张量排布的张量形状不一致;
所述确定单元,具体用于:根据所述第一张量排布和所述第二张量排布确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,所述第五等价张量排布的设备矩阵与所述第六等价张量排布的设备矩阵一致,所述第五等价张量排布的张量形状与所述第六等价张量排布的张量形状一致;所述中间张量排布包括所述第五等价张量排布和所述第六等价张量排布。
在第二方面的一种可能的实现方式中,所述第五等价张量排布的张量映射与所述第六等价张量排布的张量映射不一致;
所述确定单元,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第五等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第六等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
在第二方面的一种可能的实现方式中,所述第一张量排布的设备矩阵和所述第二张量排 布的设备矩阵一致,所述第一张量排布的张量形状与所述第二张量排布的张量形状一致,且所述第一张量排布的张量映射与所述第二张量排布的张量映射不一致;
所述确定单元具体用于:
确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子用于输入所述第一算子的输出张量,并输出所述第二算子的输入张量。
在第二方面的一种可能的实现方式中,所述获取单元,具体用于:
获取深度神经网络模型和切分策略,所述切分策略包括所述深度神经网络模型的张量在每个维度上的切分数量;
所述确定模块具体用于根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布。
在第二方面的一种可能的实现方式中,
所述切分策略包括第一切分策略和第二切分策略;
所述确定单元具体用于:
确定所述第一切分策略对应的第一整体张量排布,以及所述第二切分策略对应的第二整体张量排布,所述第一整体张量排布为基于所述第一切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第二整体张量排布为基于所述第二切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布;
从所述第一整体张量排布和所述第二整体张量排布中确定所述第一整体张量排布为所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第一整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和,小于基于所述第二整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和。
在第二方面的一种可能的实现方式中,所述第一整体张量排布的代价模型小于所述第二整体张量排布的代价模型,所述第一整体张量排布的代价模型为根据所述第一整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值;
所述第二整体张量排布的代价模型为根据所述第二整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值。
在第二方面的一种可能的实现方式中,所述切分策略为用户指定的切分策略。
在第二方面的一种可能的实现方式中,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集。
本申请实施例第三方面提供了一种数据处理设备,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第四方面提供了一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第五方面提供了一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如上述第一方面以及各种可能的实现方式中任一项所述的方法。
本申请实施例第六方面提供了一种芯片,包括处理器。处理器用于读取并执行存储器中存储的计算机程序,以执行上述任一方面任意可能的实现方式中的方法。可选地,该芯片该包括存储器,该存储器与该处理器通过电路或电线与存储器连接。进一步可选地,该芯片还包括通信接口,处理器与该通信接口连接。通信接口用于接收需要处理的数据和/或信息,处理器从该通信接口获取该数据和/或信息,并对该数据和/或信息进行处理,并通过该通信接口输出处理结果。该通信接口可以是输入输出接口。
本申请实施例第七方面提供了一种分布式集群,其特征在于,所述分布式集群包括一个或多个上述第二方面以及各种可能的实现方式中任一项所述的数据处理设备。
其中,第二方面、第三方面、第四方面、第五方面、第六方面和第七方面中任一种实现方式所带来的技术效果可参见第一方面中相应实现方式所带来的技术效果,此处不再赘述。
本申请实施例提供的数据处理方法,数据处理设备获取深度神经网络模型,以及其中所有算子的张量排布,包括输入张量的张量排布和输出张量的张量排布,其中,若存在连续的两个算子,第一算子的输出张量的张量排布与第二算子的输入张量的张量排布不一致,则根据该张量排布得到的切片计算图无法执行,本申请中,数据处理设备确定第一算子和第二算子之间的重排布算子,用于将第一算子的输出张量的张量排布转换为第二算子的输入张量的张量排布,由此,在切片计算图中插入重排布算子,确定的更新后的切片计算图可以被执行。该方案可以应用于深度神经网络模型中每个算子任意张量排布的情况,通过重排布算子实现并行方式的转换,使得各类混合并行可以在分布式集群中实现。
附图说明
图1为本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种应用环境示意图;
图3为本申请实施例提供的一种计算图的示意图;
图4为本申请实施例提供的一种分布式集群拓扑结构示意图;
图5为本申请实施例中一个应用场景的示意图;
图6为本申请实施例中数据处理方法的一个实施例示意图;
图7为本申请实施例中张量排布的一个实施例示意图;
图8为本申请实施例中数据处理方法的另一个实施例示意图;
图9为本申请实施例中生成重排布算子的一个实施例示意图;
图10为本申请实施例中生成重排布算子的另一个实施例示意图;
图11为本申请实施例中确定整体张量排布的一个实施例示意图;
图12为本申请实施例中确定整体张量排布的另一个实施例示意图;
图13为本申请实施例中数据处理设备的一个实施例示意图;
图14为本申请实施例中数据处理设备的另一个实施例示意图;
图15为本申请实施例提供的一种芯片硬件结构图。
具体实施方式
本申请实施例提供了一种数据处理方法,应用于深度神经网络模型的训练的分布式集群,可以每个算子的张量排布实现深度神经网络模型训练过程的并行。
为了便于理解,下面对本申请实施例涉及的部分技术术语进行简要介绍:
1、深度神经网络模型,本申请实施例中也称模型或网络或算法,分为正向算法部分和反向计算部分;前向传播,或称前向计算部分、正向计算部分,就是模型计算的过程,可以针对一组输入给出相应的输出;反向传播,或称反向计算部分:就是训练模型参数,在所有参数上用梯度下降,使模型在训练数据上的损失函数最小。
2、计算图:也叫数据流图。神经网络中的每一个计算都是计算图上的一个节点,而节点之间的边表示数据的输入输出之间的依赖关系。切片计算图,相较全量计算图,原有的节点不变,节点之间的边对应的数据量为完整数据量的部分,此外,本申请实施例中,在切片计算图中还可以增加用于重排布的节点。
3、张量:n维数组,它是标量、1维矢量和2维矩阵的n维推广。在机器学习的模型训练中,训练数据和中间计算结果都可以看作为是张量。
4、算子:对张量的属性运算,如矩阵乘法、张量加法、卷积等。本文中可与神经网络中的层(layer)等价。
5、张量形状:张量每个维度元素个数组成的一维数组。
6、设备矩阵:一维数组。表达设备排布方式。数组元素个数表示设备排列的维度。数组元素乘积等于总设备数。
7、张量映射:一维数组。元素个数等于张量形状的元素个数。数值表示张量对应维度切分在设备矩阵上的映射。
8、张量排布:分布式张量在各个设备中的排布。由张量形状、设备矩阵和张量映射共同表达。
9、分布式张量排布(tensor layout),为分布式张量中的元素在各个设备上的排布方式。本申请实施例中也简称为张量排布。
张量形状(tensor shape):完整形状,每个维度的元素数量。
设备矩阵(device matrix):每个算子设计一个设备矩阵。
张量映射(tensor map):向量,和张量形状的维度一样。
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请中出现的术语“和/或”,可以是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。另外,本申请中字符“/”,一般表示前后关联对象是一种“或”的关系。本申请中,“至少一个”是指一个或多个,“多个”是指两个或两个以上。“以下至少 一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语 音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
本申请实施例提供的数据处理方法,可以应用于各类分布式集群场景下的深度神经网络模型的并行训练,可以为每个算子独立确定切分策略,生成张量排布,通过插入重排布算子得到单个数据处理设备可执行的深度神经网络模型的切片图。
参见附图2,本申请实施例提供了一种系统架构200。可选地,数据采集设备260用于采集数据并存入数据库230,训练设备220基于数据库230中维护的数据生成目标模型/规则201。可选地,该数据可以是文本数据、音频数据或图像数据,图像包括图片和视频,具体数据类型此处不做限定。下面将更详细地描述训练设备220如何基于数据得到目标模型/规则201,目标模型/规则201能够用于文本翻译、语音识别、人脸识别、三维重建和虚拟现实等应用场景。
该目标模型/规则201可以是基于深度神经网络得到的,下面对深度神经网络进行介绍。
深度神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2021074108-appb-000001
来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2021074108-appb-000002
完成,4的操作由+b完成,5的操作则由a()来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文所述的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
训练设备220得到的目标模型/规则可以应用不同的系统或设备中。在附图2中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。
计算模块211使用目标模型/规则201对输入的数据进行处理,以三维建模为例,计算模块211可以对输入的图像或图像序列进行解析,恢复目标的深度信息。
关联功能模块213可以对计算模块211中的图像数据进行预处理。
关联功能模块214可以对计算模块211中的图像数据进行预处理。
最后,I/O接口212将处理结果返回给客户设备240,提供给用户。
更深层地,训练设备220可以针对不同的目标,基于不同的数据生成相应的目标模型/规则201,以给用户提供更佳的结果。
在附图2中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备240也可以作为数据采集端将采集到训练数据存入数据库230。
值得注意的,附图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。再例如:在附图2中,训练设备220、执行设备210以及客户设备240为各自独立的设备,在其他情况下,训练设备220和执行设备210可以为同一物理设备,该物理设备可以实现该训练设备220和执行设备210的所有功能;可选的,执行设备210以及客户设备240也可以为同一物理设备,该物理设备可以实现该执行设备210以及客户设备240的所有功能;可选的,训练设备220、执行设备210以及客户设备240均同一物理设备,该物理设备训练设备220、执行设备210以及客户设备240的所有功能,对于本申请实施例具体场景架构,此处不做限定。
现有的深度学习模型的并行方案:网格张量流(Mesh-Tensorflow)方案中,可以根据模型中所有张量的任意维度进行切分,但是所有张量的设备矩阵必须相同。由于Mesh-Tensorflow方案要求所有张量的设备矩阵相同,导致张量映射存在约束,例如,张量的样本维度必须映射到所有张量的设备矩阵的同一维度,限制了多种并行方式的转换,例如数据并行和模型并行的混合并行无法实现。无法单独为每个算子进行张量切分,并行方案的整体通信效率较低。
请参阅图3,为本申请实施例提供的一种计算图的示意图;
深度神经网络模型中的层(layer)可视为一个算子,深度神经网络模型包括多个算子,图3中以第一算子和第二算子为例介绍深度神经网络计算图的局部,第一算子的输入张量包括数据张量X以及参数张量W,第一算子的输出张量为第二算子的输入张量,第二算子还具有一个输入张量,即参数张量V,第二算子的输出张量为张量Z。可以理解的是,算子的张量包括输入张量和输出张量,算子的张量的数量不做限定,算子可以具有输入的参数张量,也可以没有输入的参数张量,具体此处不做限定,图3仅为一种可能的计算图。
图4为本申请实施例提供的一种分布式集群拓扑结构示意图;
分布式集群通常包括多台服务器,每台服务器可能包括多个数据处理设备,数据处理设备具体可以是CPU、GPU或者其他类型的处理器,例如昇腾芯片等,具体此处不做限定。图4示意了一种可能的分布式集群拓扑结构,该分布式集群中包括n台服务器,每台服务器中部署了8个数据处理设备,通常简称为一机8卡,服务器之间通过交换网络进行通信,可以理解的是,服务器之间的通信时延大于服务器内部数据处理设备之间的通信时延。深度神经网络可以部署于该分布式集群,由多台服务器中的多个数据处理设备并行进行深度神经网络模型的训练。
图5为本申请实施例中一个应用场景的示意图;
深度神经网络模型通过python前端图编译,可以转换得到全量计算图,本申请中获取单机执行的全量计算图,根据全量计算图,按照本申请实施例提供的数据处理方法,生成并行方案,得到切片计算图,切片计算图用于单个数据处理设备执行深度神经网络模型训练的部分,切片计算图通过自动微分图优化进行图编译,可以得到执行图。
本申请实施例提供的数据处理方法,主要用于生成并行方案,具体包括,算子并行切分建模模块、代价模型模块、并行切分策略和图切分模块。算子并行切分建模模块用于可以生成多个候选切分策略以及对应的深度神经网络模型整体的算子张量排布,代价模型模块可以根据开销大小从多个候选切分策略中确定目标切分策略,并行切分策略模块用于插入重排布算子,图切分模块根据插入重排布算子更新后的深度神经网络模型,以及每个算子的张量排布进行图切分,从而得到切片计算图。
请参阅图6,为本申请实施例中数据处理方法的一个实施例示意图;
601、数据处理设备获取深度神经网络模型和切分策略;
数据处理设备获取深度神经网络模型,本申请实施例中也称模型或网络或算法,根据获取的模型,确定该模型由单机执行的全量正向图。
数据处理设备获取切分策略,切分策略是指该模型中每个张量的切分方式,包括张量在每个维度上的切分数量。需要说明的是,该切分策略可以由用户指定,也可以由数据处理设备生成切分策略,具体此处不做限定。可选地,由数据处理设备根据模型和设备拓扑信息生成多个候选切分策略,并从多个候选切分策略中确定最优切分策略,具体实现方法请参考后续实施例。
需要说明的是,步骤601获取切分策略为可选操作,若数据处理设备根据模型和设备拓扑信息生成多个候选切分策略,进而获取多个整体张量排布,从中确定目标整体张量排布,可以直接执行步骤603。
602、数据处理设备根据深度神经网络模型和切分策略确定每个算子的张量排布;
分布式张量排布(tensor layout),为分布式张量中的元素在各个设备上的排布方式,由设备矩阵(device matrix),张量映射(tensor map)和张量形状(tensor shape)组成,本实施例及后续实施例中,将分布式张量排布简称为张量排布。其中,张量形状是指张量每个维度元素个数组成的一维数组。设备矩阵用于表达设备排布方式,为一维数组,数组元素个数表示设备排列的维度,数组元素乘积等于总设备数。张量映射:一维数组,元素个数等于张量形状的元素个数,每个元素的数值表示对应的张量每个维度切分在设备矩阵上的映射。 张量排布表达可以支持算子在各种并行模式下的张量排布需要。
数据处理设备确定每个算子的张量排布,即可获取该数据处理设备的张量切片,得到切片计算图。
数据处理设备根据深度神经网络模型全量正向图和切分策略,可以确定每个算子的张量排布,具体包括,根据深度神经网络模型确定所有算子的张量,包括张量每个维度元素个数,即张量形状;根据切分策略,即张量在每个维度上的切分数量,以及分布式集群的设备拓扑,包括每个服务器中数据处理设备数量,服务器之间的连接关系等,可以确定定每个算子的张量的设备矩阵和张量映射。
可选地,若步骤601中,切分策略为用户指定,数据处理设备可以根据用户指定的切分策略生成,按照预设的规则,生成每个算子的张量排布,也可以根据用户指定的切分策略生成多种整体张量排布,整体张量排布为深度神经网络模型中每个算子的张量的张量排布,再从多种整体张量排布中确定一个最优整体张量排布。
可选地,若步骤601中,切分策略为数据处理设备从多种候选切分策略中确定,数据处理设备可以确定该切分策略对应的最优整体张量排布。具体请参考后续实施例。
示例性的,请参考图7,为本申请实施例中张量排布的一个实施例示意图;
矩阵乘算子(matmul)的输入张量为tensor A和tensor B,输出张量为tensor C,
图7中表1所示即为tensor A、tensor B和tensor C的张量排布,可以看出三个张量的设备矩阵均为[3,2,4],tensor A的tensor shape为[h A,w A],tensor map为[2,1],代表张量A的第0维h A映射到设备矩阵第2维,第1维w A映射到设备矩阵第1维。类似地,tensor B的tensor shape为[hB,wB],tensor map为[1,0],tensor C的tensor shape为[hC,wC],tensor map为[2,0]。
需要说明的是,步骤601至步骤602为可选步骤,数据处理设备可以在步骤603之前,直接获取深度神经网络模型,以及该深度神经网络模型中每个算子的输入张量的张量排布和每个算子的输出张量的张量排布。根据深度神经网络模型,以及该深度神经网络模型中每个算子的输入张量的张量排布和每个算子的输出张量的张量排布即可获取该数据处理设备的张量切片,得到切片计算图。
603、数据处理设备在具有不同张量排布的连续算子之间插入重排布算子;
假设第一算子和第二算子为深度神经网络模型全量正向图中两个连续算子,且第一算子的输出张量的张量排布(tensor_layout_from)与第二算子的输入张量的张量排布(tensor_layout_to)不一致,则数据处理设备无法执行步骤602确定的张量切片,本申请实施例通过在第一算子和第二算子之间插入重排布(redistribution)算子。
张量排布不一致,是指设备矩阵、张量映射和张量形状中至少有一个不同。
若tensor_layout_from与tensor_layout_to的设备矩阵不一致,则数据处理设备进行设备矩阵归一化,确定tensor_layout_from的第一等价张量排布(tensor_layout_from2),以及tensor_layout_to的第二等价张量排布(tensor_layout_to2),tensor_layout_from2的设备排布与tensor_layout_to2的设备排布一致,等价张量排布之间的转换可以通过重塑(reshape)算子实现,例如,tensor_layout_from通过reshape算子转换为tensor_layout_from2。
若tensor_layout_from与tensor_layout_to的设备矩阵一致,且张量形状不一致;或者,若tensor_layout_from2的张量形状与tensor_layout_to2的张量形状不一致;则数据处理设备进行张量形状归一化,确定tensor_layout_from的第三等价张量排布(tensor_layout_from3),以及tensor_layout_to的第四等价张量排布(tensor_layout_to3),tensor_layout_from3的张量形状与tensor_layout_to3的张量形状一致,等价张量排布之间的转换可以通过重塑(reshape)算子实现,例如,tensor_layout_from2通过reshape算子转换为tensor_layout_from3。
若tensor_layout_from与tensor_layout_to的设备矩阵一致,且张量形状一致,且张量映射不一致;或者tensor_layout_from2与tensor_layout_to2的设备排布一致,且张量形状一致,且张量映射不一致;或者tensor_layout_from3与tensor_layout_to3的设备排布一致,且张量形状一致,且张量映射不一致;则在张量形状不一致的张量之间插入切分(slice)算子、合并(concat)算子或通信算子等张量映射转换算子。通信算子包括alltoall算子和allgather算子等。通过张量映射转换算子,可以实现张量映射不同的张量之间的转换。例如tensor_layout_from3到tensor_layout_to3之间的转换。可以理解的是,生成通信算子或切分算子时可以获取算子的张量排布。
根据tensor_layout_from与tensor_layout_to不一致的具体情况,数据处理设备将生成reshape算子、slice算子、concat算子、alltoall算子和allgather算子中的一个或者多个,即重排布(redistribution)算子可以包括单个算子,或者多个算子组成的重排布算子序列,具体算子数量和算子类型此处不做限定。通过重排布算子,可以实现tensor_layout_from到tensor_layout_to的转换。
生成重排布算子的具体计算方法,请参考后续实施例。
需要说明的是,深度神经网络模型中可能存在多组第一算子和第二算子,在每组第一算子和第二算子之间插入重排布算子,可以确定可执行的更新的深度神经网络模型。
可选地,若步骤601中,切分策略为数据处理设备从多种候选切分策略中确定,数据处理设备可以直接获取第一算子和第二算子之间需要插入重排布算子,本步骤中将重排布算子按照序列顺序插入第一算子和第二算子之间既可。
604、数据处理设备根据重排布算子更新切片计算图,确定更新后的切片计算图;
数据处理设备根据重排布算子对切片计算图进行更新,然后根据图编译流程,经过自动微分、图优化等流程生成切片执行图,单个数据处理设备基于对应的切片执行图执行深度神经网络模型训练过程的一个部分。
类似地,分布式集群中每个数据处理设备可以获取本数据处理设备的切片计算图,通过切片计算图生成执行图,执行深度神经网络模型训练过程的一个部分。
本申请实施例提供的数据处理方法,数据处理设备获取深度神经网络模型,以及其中所有算子的张量排布,包括输入张量的张量排布和输出张量的张量排布,其中,若存在连续的两个算子,第一算子的输出张量的张量排布与第二算子的输入张量的张量排布不一致,则根据该张量排布得到的切片计算图无法执行,本申请中数据处理设备确定第一算子和第二算子之间的重排布算子,用于将第一算子的输出张量的张量排布转换为第二算子的输入张量的张量排布,由此,在切片计算图中插入重排布算子,确定的更新后的切片计算图可以被执行。 该方案可以应用于任意给定的算子张量排布的情况,通过重排布算子实现并行方式的转换,使得各类混合并行可以在分布式集群中实现。
本申请实施例提供的数据处理方法,支持张量灵活的切分并行方式。每个算子独立建模,可以在张量所有维度上切分。算子之间通过重排布算子序列转换张量排布。现有的Mesh-Tensorflow方法在整个深度神经网络上建模,无法支持算子之间并行模式的转换。
示例性的,混合并行方式1:通道(Channel)模型并行转批量(batch)数据并行,是深度学习推荐模型(deep learning recommendation model,DLRM)常用的并行训练方式。对于该混合并行,Mesh-Tensorflow方法无法支持算子之间自由地切换并行模式,而本申请通过重排布算子生成装置,可以生成任意张量排布转换所需的重排布算子,支持算子级别的灵活的切分策略配置。
示例性的,混合并行方式2:数据并行叠加模型并行,是Transformer网络常用的并行方式。现有方案无法支持张量的全维度切分,因此无法支持这种切分场景。本申请则可以支持张量各个维度的切分。
对于任意指定的并行策略,例如数据并行转模型并行,本申请提供的刚拿均可支持。
请参阅图8,为本申请实施例中数据处理方法的另一个实施例示意图;
图8示意出本申请实施例中数据处理方法整体流程,网络模型通过代价模型和并行策略搜索,其中并行策略可以混合数据并行和模型并行。根据并行策略得到多个切片计算图之后,通过通信网络,由分布式集群中的多个数据处理设备执行。
请参阅图9,为本申请实施例中生成重排布算子的一个实施例示意图;
假设第一算子和第二算子为深度神经网络模型全量正向图中两个连续算子,且第一算子的输出张量的张量排布(tensor_layout_from)与第二算子的输入张量的张量排布(tensor_layout_to)不一致,则深度神经网络模型无法按照步骤602确定的张量排布进行切分和执行,本申请实施例通过在第一算子和第二算子之间插入重排布(redistribution)算子,对切片计算图进行更新,使得更新后的切片计算图可执行。
张量排布不一致,是指设备矩阵、张量映射和张量形状中至少有一个不同。
若tensor_layout_from与tensor_layout_to的设备矩阵不一致,则执行步骤901至步骤905;
若tensor_layout_from与tensor_layout_to的设备矩阵一致,且张量形状不一致,则执行步骤902至步骤905;
若tensor_layout_from与tensor_layout_to的设备矩阵一致,张量形状一致,且张量映射不一致,则执行步骤903至步骤905;
若tensor_layout_from与tensor_layout_to的设备矩阵、张量映射和张量形状均一致,则不需要在第一张量和第二张量之间插入重排步算子。
901、数据处理设备确定tensor_layout_from在扩展设备矩阵下的等价排布tensor_layout_from2,以及tensor_layout_to在扩展设备矩阵下的等价排布tensor_layout_to2;
数据处理设备根据tensor_layout_from的设备矩阵和tensor_layout_to的设备矩阵确定与tensor_layout_from等价的第一等价张量排布tensor_layout_from2,以及与 tensor_layout_to等价的第二等价张量排布tensor_layout_to2。
可选的,确定tensor_layout_from2,以及tensor_layout_to2的一种实施方法如下:
1)基于tensor_layout_from的设备矩阵和tensor_layout_to的设备矩阵不同,数据处理设备根据tensor_layout_from的第一设备矩阵(device_matrix_from)[A 0,A 1,…,A n]和tensor_layout_to的第二设备矩阵(device_matrix_to)[B 0,B 1,…,B n]确定第一设备矩阵的累积设备矩阵和第二设备矩阵的累积设备矩阵。A 0,A 1,…,A n,为第一设备矩阵中的元素,n为大于或等于1的正整数;B 0,B 1,…,B m,为第二设备矩阵中的元素,m为大于或等于1的正整数。
第一设备矩阵[A 0,A 1,…,A n]的累积设备矩阵为[A 0A 1…A n,A 1…A n,…,A n];
第二设备矩阵[B 0,B 1,…,B m]的累积设备矩阵为[B 0B 1…B n,B 1…B n,…,B m]。
示例性的,device_matrix_from=[2,16],device_matrix_to=[4,4,2],device_matrix_from的累积设备矩阵device_matrix_from_accum=[2*16,16]=[32,16],device_matrix_to的累积设备矩阵device_matrix_to_accum=[4*4*2,4*2,2]=[32,8,2]。
2)根据第一设备矩阵的累积设备矩阵和第二设备矩阵的累积设备矩阵确定最小累积归一化拓展设备矩阵device_matrix_equal2_accum;
根据第一设备矩阵的累积设备矩阵和第二设备矩阵的累积设备矩阵的并集device_matrix_equal2_accum,得到最小累积归一化拓展设备矩阵[C 0C 1…C k,C 1…C k,…,C k];k为大于或等于1的正整数,C 0C 1…C k,C 1…C k,…,C k为最小累积归一化拓展设备矩阵中的元素,元素个数等于k+1,具体为第一设备矩阵的累积设备矩阵和第二设备矩阵的累积设备矩阵中去除重复元素后的元素数量。
示例性的,根据device_matrix_from的累积设备矩阵device_matrix_from_accum=[32,16],device_matrix_to的累积设备矩阵device_matrix_to_accum=[32,8,2],求并集得到device_matrix_equal2_accum=[32,16,8,2]。
3)根据最小累积归一化拓展设备矩阵确定最小归一化拓展设备矩阵;
根据最小累积归一化拓展设备矩阵[C 0C 1…C k,C 1…C k,…,C k],得到最小归一化拓展设备矩阵[C 0,C 1,…,C k]。
示例性的,根据device_matrix_equal2_accum=[32,16,8,2],得到最小归一化扩展设备矩阵device_matrix_equal2=[2,2,4,2]。
4)根据最小归一化拓展设备矩阵确定第一张量的第一等价张量排布(tensor_layout_from2)和第二张量的第二等价张量排布(tensor_layout_to2);
具体的,根据根据最小归一化拓展设备矩阵,确定第一张量排布(tensor_layout_from)的等价张量排布包括:
假设确定tensor_layout_from为:
tensor_shape=[s[N-1],s[N-2],…,s[0]];s[N-1],s[N-2],…,s[0]为张量形状中的元素。
device_matrix=[d[D-1],d[D-2],…,d[0]];d[D-1],d[D-2],…,d[0]为设备矩阵中的元素。
tensor_map=[m[N-1],m[N-2],…,m[0]];m[N-1],m[N-2],…,m[0]为张量映射中的元素。
假设将device_matrix=[d[D-1],d[D-2],…,d[0]]中的d[i]=mn扩展,扩展后的设备矩阵为:device_matrix_e=[d[D-1],d[D-2],…,d[i+1],m,n,…,d[0]]。
判断tensor_map中是否存在m[k]=i;
若tensor_map中存在m[k]=i,则等价张量排布的张量映射为:
tensor_map_e=[me[N],me[N-1],…,me[k+1],me[k],…,me[0]];
Figure PCTCN2021074108-appb-000003
等价张量排布的张量形状:
tensor_shape_e=[se[N-1],se[N-2],…,se[k+1],se[k],…,se[0]];se[N-1],se[N-2],…,se[k+1],se[k],…,se[0]为tensor_shape_e中的元素。
其中,se[k+1]=m;
se[k]=s[k]/m;
Figure PCTCN2021074108-appb-000004
若tensor_map中不存在m[k]=i,则等价张量排布的张量映射和张量形状不变。
示例性的:
例1、原张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[1,0],基于最小归一化拓展设备矩阵,device_matrix_e=[4,2,4],确定等价张量排布中:tensor_map_e=[2,1,0],tensor_shape_e=[4,128,1024]。
例2、原张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[0,1],基于最小归一化拓展设备矩阵,device_matrix_e=[4,2,4],确定等价张量排布中:tensor_map_e=[0,2,1],tensor_shape_e=[512,4,256]。
例3、原张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[-1,0],基于最小归一化拓展设备矩阵,device_matrix_e=[4,2,4],确定等价张量排布中:tensor_map_e=[-1,0],tensor_shape_e=[512,1024]。
例4、原张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[-1,1],基于最小归一化拓展设备矩阵,device_matrix_e=[4,2,4],确定等价张量排布中:tensor_map_e=[-1,2,1],tensor_shape_e=[512,4,256]。
需要说明的是,tensor_layout_from与tensor_layout_from2之间的转换可以通过 reshape算子实现,tensor_layout_to与tensor_layout_to2之间的转换也可以通过reshape算子实现。
902、数据处理设备根据tensor_layout_from2,以及tensor_layout_to2,实现张量形状归一化,确定与tensor_layout_from2等价的第三等价张量排布(tensor_layout_from3),以及与tensor_layout_to2等价的第四等价张量排布(tensor_layout_to3);
若第一等价张量排布(tensor_layout_from2)与第二等价张量排布(tensor_layout_to2)的设备矩阵一致且张量形状不一致,则执行本步骤,若tensor_layout_from2与tensor_layout_to2的设备矩阵一致,且张量形状一致,则直接执行步骤903;可选地,若tensor_layout_from与tensor_layout_to的设备矩阵一致,且张量形状不一致,则可以不执行步骤901,直接执行本步骤902,可以认为tensor_layout_from2等于tensor_layout_from,tensor_layout_to2等于tensor_layout_to。
假设将tensor_shape=[s[N-1],s[N-2],…,s[0]]中的s[i]=mn扩展,扩展后的张量形状:tensor_shape_e=[s[N-1],s[N-2],…,s[i+1],m,n,s[i-1],…,s[0]]。
判断d[m[i]]是否大于m:
若d[m[i]]>m;
则等价张量排布的设备矩阵:
device_matrix_e=[de[D],de[D-1],…,de[0]];
Figure PCTCN2021074108-appb-000005
等价张量排布的张量映射:
tensor_map_e=[me[N],me[N-1],…,me[i+1],me[i],…,me[0]]
Figure PCTCN2021074108-appb-000006
若d[m[i]]>m;
则等价张量排布的设备矩阵不变;
等价张量排布的张量映射为:
Figure PCTCN2021074108-appb-000007
示例性的:
例1、拓展前张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[1,0],基于扩展后的张量形状:tensor_shape_e=[512,2,512];确定等价张量排布中:device_matrix_e=[8,2,2],tensor_map_e=[2,1,0]。
例2、拓展前张量排布:tensor_shape=[512,1024],device_matrix=[8,4],tensor_map=[1,0],基于扩展后的张量形状:tensor_shape_e=[128,4,1024];确定等价张量排布中:device_matrix_e=[8,4],tensor_map_e=[1,-1,0]。
需要说明的是,tensor_layout_from2与tensor_layout_from3之间的转换可以通过reshape算子实现,tensor_layout_to2与tensor_layout_to3之间的转换也可以通过reshape算子实现。
在执行步骤903之前,首先判断tensor_layout_from3和tensor_layout_to3的张量映射是否一致,若一致,说明张量切片的形状相同,则不执行步骤903,若不一致,则说明张量切片的形状不同,需要转换张量映射,即插入张量映射转换算子,即执行步骤903;
903、根据第一张量的第三等价张量排布(tensor_layout_from3)和第二张量的第四等价张量排布(tensor_layout_to3)转换张量映射,确定张量映射转换算子;
由于tensor_layout_from3与tensor_layout_to3的设备排布一致,且张量形状一致,且张量映射不一致,因此需要在张量形状不一致的张量之间插入张量映射转换算子,包括通信算子、切分(slice)算子或合并算子。通信算子包括alltoall算子和allgather算子等。可选地,若tensor_layout_from2与tensor_layout_to2的设备矩阵一致,张量形状一致,且张量映射不一致,则可以不执行步骤902,直接执行本步骤903,可以认为tensor_layout_from3等于tensor_layout_from2,tensor_layout_to3等于tensor_layout_to2。可选地,若tensor_layout_from与tensor_layout_to的设备矩阵一致,张量形状一致,且张量映射不一致,则可以不执行步骤901至步骤902,直接执行本步骤903,可以认为tensor_layout_from3等于tensor_layout_from,tensor_layout_to3等于tensor_layout_to。
通过张量映射转换算子,可以实现张量映射不同的张量之间的转换。例如tensor_layout_from3到tensor_layout_to3之间的转换。可以理解的是,生成通信算子、切分算子以及合并算子时可以获取算子的张量排布。
根据张量映射不一致的具体情况,数据处理设备将生成reshape算子、slice算子、alltoall算子和allgather算子中的一个或者多个。
可选的,本申请实施例中一种tensor map的转换流程:
1)首先判断,tensor_map_from3与tensor_map_to3,之间是否存在由不切分到切分的转换,tensor map中的元素代表张量每个维度的切分在设备矩阵上的映射,若tensor map中的元素数值大于或等于0,代表对应的设备矩阵的维度,若tensor map中的元素数值为-1,代表该张量维度步骤设备矩阵中进行切分。因此,若tensor_map_from3和tensor_map_to3之间对应元素存在-1至大于或等于0数值的转换,则插入切分slice算子。
示例性的:tensor_map_from3为(-1,-1,-1,3,2,1,0)转换为tensor_map_to3(4,-1,-1,3,2,1,0),第0维元素由-1变为4,则插入slice算子,切分张量的第0维。
2)若tensor_map_from3与tensor_map_to3之间存在元素位置转换,则插入通信算子(alltoall算子)。
示例性的,tensor_map由(4,-1,-1,3,2,1,0)转换为(4,3,-1,-1,2,1,0),第1维和第3为的元素交换,则插入alltoall算子。
3)最后执行切分到不切分之间的转换,元素存在大于或等于0数值至-1的转换,则插入通信算子(allgather算子)。
根据一系列张量映射转换算子,可以将tensor_layout_from3转换为tensor_layout_to3。
904、确定重塑(reshape)算子;
根据步骤902得到的tensor_layout_from3,tensor_layout_to3,确定reshape算子。
判断tensor_layout_from和tensor_layout_from3确定的切片形状是否一致,若不一致,则需要插入第一reshape算子,第一reshape算子的输入为tensor_layout_from,输出为tensor_layout_from3;
判断tensor_layout_to和tensor_layout_to3确定的切片形状是否一致,若不一致,则需要插入第二reshape算子,第二reshape算子的输入为tensor_layout_to,输出为tensor_layout_to3;
例1:tensor_shape_from=[1024,512],tensor_shape_from3=[2,256,2,2,2,2,64],形状不同,根据tensor_layout_from3,得到切片形状tensor_shape_from_slice3=[2,256,2,1,1,1,32],因此在重排布序列开始插入reshape算子,reshape后的形状为tensor_shape_from_slice3。
例2:tensor_shape_to=[512,1024],tensor_shape_to3=[2,256,2,2,2,2,64],形状不同,根据tensor_layout_to,得到切片形状tensor_shape_from_slice=[128,256],因此在重排布序列结束插入reshape算子,reshape后的形状为tensor_shape_from_slice。
最后,可以确定重排布序列;重排布序列为第一算子和第二算子之间按顺序插入的算子序列,重排布序列包括一个或多个算子,具体数量此处不做限定。重排布序列包括根据步骤901至步骤904确定的所有重排步算子,包括切分算子、合并算子、reshape算子和通信算子中的一个或多个。
第一算子和第二算子为深度神经网络模型全量正向图中两个连续算子,且第一算子的输出张量的张量排布(tensor_layout_from)与第二算子的输入张量的张量排布(tensor_layout_to)不一致,根据重排布序列,可以将,第一算子输出张量的张量排布(tensor_layout_from)转换为第二算子的输入张量的张量排布(tensor_layout_to)。
本申请实施例提供的重排布算子生成装置可对任意张量排布转换生成所需重排布算子。基于tensor layout表达,对任意tensor layout转换生成重排布算子。生成的算子序列开销最小。
图10为本申请实施例中生成重排布算子的另一个实施例示意图;
假设第一算子和第二算子为深度神经网络模型全量正向图中两个连续算子,且第一算子的输出张量的张量排布(tensor_layout_from)与第二算子的输入张量的张量排布(tensor_layout_to)不一致,因此,深度神经网络模型无法根据该张量排布执行。本申请 实施例根据tensor_layout_from和tensor_layout_to生成重排布算子,通过在第一算子和第二算子之间插入重排布(redistribution)算子,对深度神经网络模型进行更新,使得更新的深度神经网络模型可执行。重排布算子包括reshape算子、通信算子、切分算子和合并算子等,在第一算子和第二算子之间插入的重排布算子可以包括reshape算子、通信算子、切分算子和合并算子中的一类或多类,具体不做限定。
数据处理设备从多个候选切分策略中确定最优切分策略,将最优切分策略对应的整体张量排布用于生成切片计算图,或者,根据多个候选切分策略,获取每个候选切分策略对应的整体张量排布,得到多个整体张量排布,从多个整体张量排布中确定最优的整体张量排布用于生成切片计算图。请参阅图11,为本申请实施例中确定整体张量排布的一个实施例示意图。
1101、根据候选切分策略确定深度神经网络模型的多个整体张量排布;
数据处理设备根据深度神经网络模型可以确定模型中所有张量,进而根据张量在每个维度的元素个数可以确定张量形状。
数据处理设备还可以获取分布式集群的设备拓扑信息,用于该深度神经网络模型的集群资源分布,包括服务器数量,每台服务器上数据处理设备的个数,以及服务器之间的连接关系。数据处理设备根据设备例如可以为GPU、CPU或者其他类型的处理器等,具体类型此处不做限定。拓扑信息可以得到分布式集群中数据处理设备的总数,该总数可以用于约束生成张量排布中的设备矩阵,数据处理设备根据全量正向图确定第一算子的张量,第一算子的张量包括输入张量和输出张量,数据处理设备根据设备数量,按照预设规则为每个算子的输入张量和输出张量确定相同的设备矩阵。
根据每个张量的张量形状,以及设备拓扑信息,可以通过遍历的方式确定不同的候选切分策略,即每个张量的切分方式,包括张量在每个维度上的切分数量,以确定深度神经网络模型中所有张量各个维度的切片在设备集群中的映射,即确定在该切分策略下,张量的设备矩阵和张量映射,从而获取深度神经网络模型的多个整体张量排布,整体张量排布是指,深度神经网络模型中所有算子的输入张量的张量排布和所有算子的输出张量的张量排布。
类似地,遍历所有候选切分策略,确定每个候选切分策略对应的整体张量排布。
可选地,根据算子类型确定不同的预设规则,以确定适宜的算子张量的设备矩阵,不同类型的算子例如:矩阵乘算子,张量加法算子、卷积算子和柔性最大值传输函数(softmax)算子等。
可选地,由于深度神经网络模型训练时,数据并行,即张量batch维相对其他维对通信时延带宽不敏感,考虑到服务器内部多个GPU之间的网络通信带宽高,通信时延较低,而服务器之间的通信时延较高,因此,优先把batch维并行切分到节点间,模型维切分到节点内。具体的,假设设备矩阵为[D2,D1,D0]。1)根据分布式集群的设备拓扑信息构造设备矩阵时:优先将同一服务器内部的数据处理设备沿D1/D2轴方向排布,然后将服务器间的或者服务器内部的数据处理设备沿D0轴方向排布。2)将张量映射到设备矩阵上时,识别张量的维度,把张量的batch维隐射映射到节点间即设备矩阵D0维,即张量的非batch维,或称模型维映射到D1或D2维。
示例性的,请参阅图12,为本申请实施例中确定整体张量排布的另一个实施例示意图;
图12示意出二维矩阵相乘算子的一种并行建模过程,共有16台设备,一机4卡,算子 的输入张量为二维张量M和二维张量N,算子的输出张量为二维张量Q。
如图所示,第一台服务器中有A1、A2、A3和A4四个数据处理设备,第二台服务器中有B1、B2、B3和B4四个数据处理设备,第三台服务器中有C1、C2、C3和C4四个数据处理设备,第四台服务器中有E1、E2、E3和E4四个数据处理设备。设备矩阵中每个元素,即图中每个立方体,对应一台数据处理设备,设备矩阵中所有元素为本申请中用于执行深度神经网络模型的分布式集群中的所有数据处理设备。图中设备矩阵[D2,D1,D0]为[2,2,4]。
张量M映射到D 0、D 1平面,假设张量M的行是样本维度,将行映射到D 0维度,样本维度对通信要求低一些,对应地,把服务器间的数据处理设备排列在D 0维,其他维度排列服务器内部的设备。示例性的,A1、B1、C1和E1为不同服务器中的数据处理设备,因此,沿设备矩阵的D 0轴排布,A1和A2为同一服务器内部的两个数据处理设备,因此,沿D1或D2轴排布。
由此,针对设备之间不同带宽设计本方案,通过设备矩阵表达集群拓扑关系,可灵活适应各种分级组合网络拓扑结构。降低通信时延。
1102、通过代价模型从多个整体张量排布中确定目标张量排布;
从所述多个候选切分策略中确定所述切分策略,根据候选切分策略得到深度神经网络模型中每个算子的张量排布,生成在具有不同张量排布的连续张量之间的重排布算子,由此,可以获取深度神经网络模型所有算子的张量排布,后简称整体张量排布。可以理解的是,根据切分策略可以获取多个整体张量排布。
基于每个整体张量排布,计算代价模型(cost model),cost model同时考虑算子和重排布算子的开销,具体用张量形状近似算子的存储开销、计算开销和通信开销度量;通过加权系数控制计算、通信开销在损失函数中的比例,适应不同设备平台。从多种候选切分策略中确定cost model最小的候选切分策略为目标切分策略,用于执行图切分。
可选地,从多个整体张量排布中确定最优的整体张量排布,用于执行图6对应的实施例。
可选地,根据多个整体张量排布确定最优的切分策略,根据最优的切分策略,按照预设规则生成整体张量排布,用于执行图6对应的实施例。
可选地,输入完整单机正向计算图,输出切片计算正向图,在切片正向图中插入重排布算子。输出的切片计算正向图通过自动微分生成反向计算图。
可选地,确定所述多个候选切分策略中通过代价模型确定损失最小的候选切分策略根据整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值,所述数据张量的大小、所述通信张量的大小和所述参数张量的大小分别为基于所述第一候选切分策略执行所述深度神经网络模型所需的数据张量的存储空间、通信张量的存储空间和参数张量的存储空间。
cost model定义损失函数=alpha*输入算子数据张量的大小+beta*通信张量的大小+gamma*算子参数张量的大小
其中,alpha代表数据张量的大小的权重系数;beta代表通信张量大小的权重系数;gamma代表参数张量的大小的权重系数。alpha、beta和gamma的具体数值此处不做限定,可选地,根据数据处理设备的类型不同,可以灵活设定alpha、beta和gamma的取值。
cost model中的算子包括重排布算子序列。不同的切分策略,算子的开销不同,算子之间重排布算子序列的开销也不相同。这里的开销包括存储和通信开销,根据实际设备的内存大小设置存储开销上限,寻找计算开销最小的算子切分策略组合。
可选地,正向算子和反向算子输入张量的形状近似估计存储和计算开销。调整alpha、beta、gamma的比例关系,可以适应不同硬件平台。例如将beta调大,意味着增加通信开销的比重,搜出的策略所需通信量降低。
本申请实施例提供的确定整体张量排布的方法,可以对算子独立建模,配置算子输入张量的张量排布和输出张量的张量排布。算子之间由重排布算子生成模块生成张量排布转换所需重排布算子。cost model同时考虑算子和重排布算子开销,用张量形状近似算子的存储、计算和通信开销;通过加权系数控制计算、通信开销在损失函数中的比例,适应不同设备平台。可选地,本方案可以通过输入完整单机正向计算图,输出切片计算正向图,在切片正向图中插入重排布、AllReduce等其他算子。输出的切片计算正向图通过自动微分生成反向计算图。
本申请实施例提供的确定整体张量排布的方法,结合图6和图9对应的实施例提供的整体自动并行流程的并行模式配置灵活,每个算子独立建模、算子之间插入重排布算子序列,可适应各种常见网络对混合并行的需求,弥补现有技术对网络类型支持上的不足。cost model同时考虑算子和重排布的开销,可以搜索出整体开销最小的并行方案。用张量形状近似计算开销,在不需要大量测试的情况下较好的近似实际开销,做到平台无关。通过加权系数控制通信计算的比例,适应不同设备平台。无需生成反向计算网络,利用自动微分和图优化功能。张量重排布支持任意张量排布的转换,且转换开销小。任意张量排布转换的支持使得算子可以独立灵活建模。张量重排布,除了用于算子之间排布转换,还可用于reshape算子的分布式实现。本申请涉及的拓扑感知调度,通过调整Tensor Layout中的设备矩阵,实现将对通信延时带宽不敏感的batch维并行放在服务器之间,模型切分放在服务器内部。通过设备矩阵配置,简单灵活适应不同的集群拓扑。
请参阅图13,为本申请实施例中数据处理设备的一个实施例示意图;
本申请实施例提供的数据处理设备,应用于分布式集群,所述设备包括:
获取单元1301,用于获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述张量排布包括设备矩阵、张量映射和张量形状,所述深度神经网络模型中包括第一算子和第二算子,所述第一算子和所述第二算子为所述深度神经网络模型中的两个连续算子,且所述第一算子的输出张量为所述第二算子的输入张量,第一张量排布与第二张量排布不一致,其中,所述第一张量排布为所述第一算子的输出张量的张量排布,所述第二张量排布为所述第二算子的输入张量的张量排布;
确定单元1302,用于根据所述每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,确定切片计算图;
所述确定单元1302,还用于确定所述第一算子和所述第二算子之间的重排布算子,所述重排布算子用于将所述第一张量排布转换为所述第二张量排布;
所述确定单元1302,还用于在所述切片计算图中插入所述重排布算子,以确定更新后的 切片计算图,所述更新后的切片计算图用于指示执行所述深度神经网络模型的部分。
可选地,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致,和/或,所述第一张量排布的张量映射不一致;
所述确定单元1302,具体用于:
根据所述第一张量排布和所述第二张量排布,确定中间张量排布;
根据所述第一张量排布和所述中间张量排布确定第一重塑算子,以及所述第一重塑算子的输入张量的张量排布和所述第一重塑算子的输出张量的张量排布,所述第一重塑算子用于实现所述第一张量排布至所述中间张量排布的转换;和/或,
根据所述第二张量排布和所述中间张量排布确定第二重塑算子,以及所述第二重塑算子的输入张量的张量排布和所述第二重塑算子的输出张量的张量排布,所述第二重塑算子位于所述第一重塑算子和所述第二算子之间,所述第二重塑算子用于实现所述中间张量排布到所述第二张量排布的转换。
可选地,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致;
所述确定单元1302,具体用于:
根据所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵,确定拓展设备矩阵,所述拓展设备矩阵的元素之积与所述第一算子的输出张量的设备矩阵的元素之积相同,与所述第二算子的输入张量的第二设备矩阵的元素之积相同,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵中任一元素,等于所述拓展设备矩阵中的一个元素或者所述拓展设备矩阵的至少两个元素的乘积;
根据所述拓展设备矩阵,确定与所述第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,所述第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致;
当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致时,所述中间张量排布包括所述第一等价张量排布和所述第二等价张量排布,所述张量形状为张量每个维度的元素数量。
可选地,所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致,且所述第一等价张量排布的张量映射与所述第二等价张量排布的张量映射不一致;
所述确定单元1302,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第一等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第二等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
可选地,当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状不一致时,所述确定单元1302还用于根据所述第一等价张量排布和所述第二等价张量排布进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,以及与所述第二等价张量排布等价的第四等价张量排布,所述第三等价张量排布的设备矩阵与所述第四等价张量排布的设备矩阵一致,所述第三等价张量排布的张量形状与所述第四等价张量排布的张量形状一致;所述中间张量排布包括所述第三等价张量排布和所述第四等价张量排布。
可选地,所述第三等价张量排布的张量映射与所述第四等价张量排布的张量映射不一致;
所述确定单元1302还用于:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第三等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第四等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
可选地,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵一致,且所述第一张量排布的张量形状与所述第二张量排布的张量形状不一致;
所述确定单元1302,具体用于:根据所述第一张量排布和所述第二张量排布确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,所述第五等价张量排布的设备矩阵与所述第六等价张量排布的设备矩阵一致,所述第五等价张量排布的张量形状与所述第六等价张量排布的张量形状一致;所述中间张量排布包括所述第五等价张量排布和所述第六等价张量排布。
可选地,所述第五等价张量排布的张量映射与所述第六等价张量排布的张量映射不一致;
所述确定单元1302,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第五等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第六等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
可选地,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵一致,所述第一张量排布的张量形状与所述第二张量排布的张量形状一致,且所述第一张量排布的张量映射与所述第二张量排布的张量映射不一致;
所述确定单元1302具体用于:
确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子用于输入所述第一算子的输出张量,并输出所述第二算子的输入张量。
可选地,所述获取单元1301,具体用于:
获取深度神经网络模型和切分策略,所述切分策略包括所述深度神经网络模型的张量在每个维度上的切分数量;
所述确定模块具体用于根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布。
可选地,所述切分策略包括第一切分策略和第二切分策略;
所述确定单元1302具体用于:
确定所述第一切分策略对应的第一整体张量排布,以及所述第二切分策略对应的第二整体张量排布,所述第一整体张量排布为基于所述第一切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第二整体张量排布为基于所述第二切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布;
从所述第一整体张量排布和所述第二整体张量排布中确定所述第一整体张量排布为所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第一整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和,小于基于所述第二整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和。
可选地,所述第一整体张量排布的代价模型小于所述第二整体张量排布的代价模型,所述第一整体张量排布的代价模型为根据所述第一整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值;
所述第二整体张量排布的代价模型为根据所述第二整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值。
可选地,所述切分策略为用户指定的切分策略。
可选地,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集。
请参阅图14,为本申请实施例中数据处理设备的另一个实施例示意图;
本实施例提供的数据处理设备,可以为处理器或服务器或者专用数据处理设备等,本申请实施例中对该具体设备形态不做限定。
该数据处理设备1400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器1401和存储器1402,该存储器1402中存储有程序或数据。
其中,存储器1402可以是易失性存储或非易失性存储。可选地,处理器1401是一个或多个中央处理器(central processing unit,CPU),或者图形处理器(graphics processing unit,GPU)或者其他专用处理器,例如昇腾(Ascend)等,该CPU可以是单核CPU,也可以是多核CPU。处理器1401可以与存储器1402通信,在数据处理设备1400上执行存储器1402中的一系列指令。
该数据处理设备1400还包括一个或一个以上有线或无线网络接口1403,例如以太网接口。
可选地,尽管图14中未示出,数据处理设备1400还可以包括一个或一个以上电源;一个或一个以上输入输出接口,输入输出接口可以用于连接显示器、鼠标、键盘、触摸屏设备或传感设备等,输入输出接口为可选部件,可以存在也可以不存在,此处不做限定。
本实施例中数据处理设备1400中的处理器1401所执行的流程可以参考前述方法实施例中描述的方法流程,此处不加赘述。
请查阅图15,为本申请实施例提供的一种芯片硬件结构图。
本申请实施例涉及的深度神经网络的算法可以在图15所示的NPU芯片中执行。
神经网络处理器NPU 50 NPU作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路503,通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路503内部包括多个处理单元(process engine,PE)。在一些实 现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器508accumulator中。
统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器505(direct memory access controller,DMAC)被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。
BIU为Bus Interface Unit即,总线接口单元510,用于AXI总线与DMAC和取指存储器509Instruction Fetch Buffer的交互。
总线接口单元510(bus interface unit,简称BIU),用于取指存储器509从外部存储器获取指令,还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506或将权重数据搬运到权重存储器502中或将输入数据数据搬运到输入存储器501中。
向量计算单元507可以包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。
在一些实现种,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,深度神经网络中各层的运算,即本申请实施例中的算子可以由矩阵计算单元或向量计算单元507执行。
本申请上述方法实施例可以应用于处理器中,或者由处理器实现上述方法实施例的步骤。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field  programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。虽然图中仅仅示出了一个处理器,该装置可以包括多个处理器或者处理器包括多个处理单元。具体的,处理器可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。
存储器用于存储处理器执行的计算机指令。存储器可以是存储电路也可以是存储器。存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。存储器可以独立于处理器,也可以是处理器中的存储单元,在此不做限定。虽然图中仅仅示出了一个存储器,该装置也可以包括多个存储器或者存储器包括多个存储单元。
收发器用于实现处理器与其他单元或者网元的内容交互。具体的,收发器可以是该装置的通信接口,也可以是收发电路或者通信单元,还可以是收发信机。收发器还可以是处理器的通信接口或者收发电路。一种可能的实现方式,收发器可以是一个收发芯片。该收发器还可以包括发送单元和/或接收单元。在一种可能的实现方式中,该收发器可以包括至少一个通信接口。在另一种可能的实现方式中,该收发器也可以是以软件形式实现的单元。在本申请的各实施例中,处理器可以通过收发器与其他单元或者网元进行交互。例如:处理器通过该收发器获取或者接收来自其他网元的内容。若处理器与收发器是物理上分离的两个部件,处理器可以不经过收发器与该装置的其他单元进行内容交互。
一种可能的实现方式中,处理器、存储器以及收发器可以通过总线相互连接。总线可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。
本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。
上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机执 行指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。

Claims (32)

  1. 一种数据处理方法,其特征在于,所述方法包括:
    获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述张量排布包括设备矩阵、张量映射和张量形状,所述深度神经网络模型中包括第一算子和第二算子,所述第一算子和所述第二算子为所述深度神经网络模型中的两个连续算子,且所述第一算子的输出张量为所述第二算子的输入张量,第一张量排布与第二张量排布不一致,其中,所述第一张量排布为所述第一算子的输出张量的张量排布,所述第二张量排布为所述第二算子的输入张量的张量排布;
    根据所述每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,确定数据处理设备的切片计算图;
    确定所述第一算子和所述第二算子之间的重排布算子,所述重排布算子用于将所述第一张量排布转换为所述第二张量排布;
    在所述切片计算图中插入所述重排布算子,以确定更新后的切片计算图,所述更新后的切片计算图用于指示执行所述深度神经网络模型的部分。
  2. 根据权利要求1所述的方法,其特征在于,
    所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致,和/或,所述第一张量排布的张量映射不一致;
    所述确定所述第一算子和所述第二算子之间的重排布算子包括:
    根据所述第一张量排布和所述第二张量排布,确定中间张量排布;
    根据所述第一张量排布和所述中间张量排布确定第一重塑算子,以及所述第一重塑算子的输入张量的张量排布和所述第一重塑算子的输出张量的张量排布,所述第一重塑算子用于实现所述第一张量排布至所述中间张量排布的转换;和/或,
    根据所述第二张量排布和所述中间张量排布确定第二重塑算子,以及所述第二重塑算子的输入张量的张量排布和所述第二重塑算子的输出张量的张量排布,所述第二重塑算子位于所述第一重塑算子和所述第二算子之间,所述第二重塑算子用于实现所述中间张量排布到所述第二张量排布的转换。
  3. 根据权利要求2所述的方法,其特征在于,
    所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致;
    所述根据所述第一张量排布和所述第二张量排布,确定中间张量排布包括:
    根据所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵,确定拓展设备矩阵,所述拓展设备矩阵的元素之积与所述第一算子的输出张量的设备矩阵的元素之积相同,与所述第二算子的输入张量的第二设备矩阵的元素之积相同,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵中任一元素,等于所述拓展设备矩阵中的一个元素或者所述拓展设备矩阵的至少两个元素的乘积;
    根据所述拓展设备矩阵,确定与所述第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,所述第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致;
    当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致时,所述 中间张量排布包括所述第一等价张量排布和所述第二等价张量排布,所述张量形状为张量每个维度的元素数量。
  4. 根据权利要求3所述的方法,其特征在于,所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致,且所述第一等价张量排布的张量映射与所述第二等价张量排布的张量映射不一致;
    所述方法还包括:
    确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第一等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第二等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  5. 根据权利要求3所述的方法,其特征在于,所述当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状不一致时,所述方法还包括:
    根据所述第一等价张量排布和所述第二等价张量排布进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,以及与所述第二等价张量排布等价的第四等价张量排布,所述第三等价张量排布的设备矩阵与所述第四等价张量排布的设备矩阵一致,所述第三等价张量排布的张量形状与所述第四等价张量排布的张量形状一致;所述中间张量排布包括所述第三等价张量排布和所述第四等价张量排布。
  6. 根据权利要求5所述的方法,其特征在于,
    所述第三等价张量排布的张量映射与所述第四等价张量排布的张量映射不一致;
    所述方法还包括:
    确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第三等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第四等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  7. 根据权利要求2所述的方法,其特征在于,
    所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵一致,且所述第一张量排布的张量形状与所述第二张量排布的张量形状不一致;
    所述根据所述第一张量排布和所述第二张量排布,确定中间张量排布包括:
    根据所述第一张量排布和所述第二张量排布确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,所述第五等价张量排布的设备矩阵与所述第六等价张量排布的设备矩阵一致,所述第五等价张量排布的张量形状与所述第六等价张量排布的张量形状一致;所述中间张量排布包括所述第五等价张量排布和所述第六等价张量排布。
  8. 根据权利要求7所述的方法,其特征在于,
    所述第五等价张量排布的张量映射与所述第六等价张量排布的张量映射不一致;
    所述方法还包括:
    确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第五等价张量排布 一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第六等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  9. 根据权利要求1所述的方法,其特征在于,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵一致,所述第一张量排布的张量形状与所述第二张量排布的张量形状一致,且所述第一张量排布的张量映射与所述第二张量排布的张量映射不一致;
    所述确定所述第一算子和所述第二算子之间的重排布算子包括:
    确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子用于输入所述第一算子的输出张量,并输出所述第二算子的输入张量。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,
    所述获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布包括:
    获取深度神经网络模型和切分策略,所述切分策略包括所述深度神经网络模型的张量在每个维度上的切分数量;
    根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布。
  11. 根据权利要求10所述的方法,其特征在于,
    所述切分策略包括第一切分策略和第二切分策略;
    所述根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子输入张量的张量排布和所述每个算子的输出张量的张量排布包括:
    确定所述第一切分策略对应的第一整体张量排布,以及所述第二切分策略对应的第二整体张量排布,所述第一整体张量排布为基于所述第一切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第二整体张量排布为基于所述第二切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布;
    所述方法还包括:
    从所述第一整体张量排布和所述第二整体张量排布中确定,所述第一整体张量排布为所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第一整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和,小于基于所述第二整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和。
  12. 根据权利要求11所述的方法,其特征在于,
    所述第一整体张量排布的代价模型小于所述第二整体张量排布的代价模型,所述第一整体张量排布的代价模型为根据所述第一整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值;
    所述第二整体张量排布的代价模型为根据所述第二整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权 重系数和参数张量的大小的权重系数加权求和得到的值。
  13. 根据权利要求10所述的方法,其特征在于,
    所述切分策略为用户指定的切分策略。
  14. 根据权利要求1至13中任一项所述的方法,其特征在于,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集。
  15. 一种数据处理设备,其特征在于,所述设备包括:
    获取单元,用于获取深度神经网络模型,以及所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述张量排布包括设备矩阵、张量映射和张量形状,所述深度神经网络模型中包括第一算子和第二算子,所述第一算子和所述第二算子为所述深度神经网络模型中的两个连续算子,且所述第一算子的输出张量为所述第二算子的输入张量,第一张量排布与第二张量排布不一致,其中,所述第一张量排布为所述第一算子的输出张量的张量排布,所述第二张量排布为所述第二算子的输入张量的张量排布;
    确定单元,用于根据所述每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,确定切片计算图;
    所述确定单元,还用于确定所述第一算子和所述第二算子之间的重排布算子,所述重排布算子用于将所述第一张量排布转换为所述第二张量排布;
    所述确定单元,还用于在所述切片计算图中插入所述重排布算子,以确定更新后的切片计算图,所述更新后的切片计算图用于指示执行所述深度神经网络模型的部分。
  16. 根据权利要求15所述的设备,其特征在于,
    所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致,和/或,所述第一张量排布的张量映射不一致;
    所述确定单元,具体用于:
    根据所述第一张量排布和所述第二张量排布,确定中间张量排布;
    根据所述第一张量排布和所述中间张量排布确定第一重塑算子,以及所述第一重塑算子的输入张量的张量排布和所述第一重塑算子的输出张量的张量排布,所述第一重塑算子用于实现所述第一张量排布至所述中间张量排布的转换;和/或,
    根据所述第二张量排布和所述中间张量排布确定第二重塑算子,以及所述第二重塑算子的输入张量的张量排布和所述第二重塑算子的输出张量的张量排布,所述第二重塑算子位于所述第一重塑算子和所述第二算子之间,所述第二重塑算子用于实现所述中间张量排布到所述第二张量排布的转换。
  17. 根据权利要求16所述的设备,其特征在于,
    所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵不一致;
    所述确定单元,具体用于:
    根据所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵,确定拓展设备矩阵,所述拓展设备矩阵的元素之积与所述第一算子的输出张量的设备矩阵的元素之积相同,与所述第二算子的输入张量的第二设备矩阵的元素之积相同,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵中任一元素,等于所述拓展设备矩阵中的一个元素或者所述拓展 设备矩阵的至少两个元素的乘积;
    根据所述拓展设备矩阵,确定与所述第一张量排布等价的第一等价张量排布,以及与所述第二张量排布等价的第二等价张量排布,所述第一等价张量排布的设备矩阵与所述第二等价张量排布的设备矩阵一致;
    当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致时,所述中间张量排布包括所述第一等价张量排布和所述第二等价张量排布,所述张量形状为张量每个维度的元素数量。
  18. 根据权利要求17所述的设备,其特征在于,所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状一致,且所述第一等价张量排布的张量映射与所述第二等价张量排布的张量映射不一致;
    所述确定单元,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第一等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第二等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  19. 根据权利要求17所述的设备,其特征在于,
    当所述第一等价张量排布的张量形状与所述第二等价张量排布的张量形状不一致时,所述确定单元还用于根据所述第一等价张量排布和所述第二等价张量排布进行张量形状归一化,确定与所述第一等价张量排布等价的第三等价张量排布,以及与所述第二等价张量排布等价的第四等价张量排布,所述第三等价张量排布的设备矩阵与所述第四等价张量排布的设备矩阵一致,所述第三等价张量排布的张量形状与所述第四等价张量排布的张量形状一致;所述中间张量排布包括所述第三等价张量排布和所述第四等价张量排布。
  20. 根据权利要求19所述的设备,其特征在于,
    所述第三等价张量排布的张量映射与所述第四等价张量排布的张量映射不一致;
    所述确定单元还用于:确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第三等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第四等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  21. 根据权利要求16所述的设备,其特征在于,所述第一张量排布的设备矩阵与所述第二张量排布的设备矩阵一致,且所述第一张量排布的张量形状与所述第二张量排布的张量形状不一致;
    所述确定单元,具体用于:根据所述第一张量排布和所述第二张量排布确定与所述第一张量排布等价的第五等价张量排布,以及与所述第二等价张量排布等价的第六等价张量排布,所述第五等价张量排布的设备矩阵与所述第六等价张量排布的设备矩阵一致,所述第五等价张量排布的张量形状与所述第六等价张量排布的张量形状一致;所述中间张量排布包括所述第五等价张量排布和所述第六等价张量排布。
  22. 根据权利要求21所述的设备,其特征在于,所述第五等价张量排布的张量映射与所 述第六等价张量排布的张量映射不一致;
    所述确定单元,还用于确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子输入张量的张量排布与所述第五等价张量排布一致,所述一个或多个张量映射转换算子输出张量的张量排布与所述第六等价张量排布一致,所述一个或多个张量映射转换算子用于确定所述更新后的切片计算图。
  23. 根据权利要求15所述的设备,其特征在于,所述第一张量排布的设备矩阵和所述第二张量排布的设备矩阵一致,所述第一张量排布的张量形状与所述第二张量排布的张量形状一致,且所述第一张量排布的张量映射与所述第二张量排布的张量映射不一致;
    所述确定单元具体用于:
    确定一个或多个张量映射转换算子,所述张量映射转换算子包括切分算子、合并算子或通信算子,所述一个或多个张量映射转换算子用于输入所述第一算子的输出张量,并输出所述第二算子的输入张量。
  24. 根据权利要求15至23中任一项所述的设备,其特征在于,所述获取单元,具体用于:
    获取深度神经网络模型和切分策略,所述切分策略包括所述深度神经网络模型的张量在每个维度上的切分数量;
    所述确定模块具体用于根据所述深度神经网络模型和所述切分策略确定所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布。
  25. 根据权利要求24所述的设备,其特征在于,
    所述切分策略包括第一切分策略和第二切分策略;
    所述确定单元具体用于:
    确定所述第一切分策略对应的第一整体张量排布,以及所述第二切分策略对应的第二整体张量排布,所述第一整体张量排布为基于所述第一切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第二整体张量排布为基于所述第二切分策略确定的所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布;
    从所述第一整体张量排布和所述第二整体张量排布中确定所述第一整体张量排布为所述深度神经网络模型中每个算子的输入张量的张量排布和所述每个算子的输出张量的张量排布,所述第一整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和,小于基于所述第二整体张量排布执行所述深度神经网络模型的训练所需的通信时间与计算时间之和。
  26. 根据权利要求25所述的设备,其特征在于,
    所述第一整体张量排布的代价模型小于所述第二整体张量排布的代价模型,所述第一整体张量排布的代价模型为根据所述第一整体张量排布中数据张量的大小、通信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值;
    所述第二整体张量排布的代价模型为根据所述第二整体张量排布中数据张量的大小、通 信张量的大小和参数张量的大小,以及所述数据张量的大小的权重系数、通信张量大小的权重系数和参数张量的大小的权重系数加权求和得到的值。
  27. 根据权利要求24所述的设备,其特征在于,所述切分策略为用户指定的切分策略。
  28. 根据权利要求15至27中任一项所述的设备,其特征在于,所述深度神经网络模型中每个算子的输入张量包括训练数据集,所述训练数据集包括文本数据集、图像数据集或音频数据集。
  29. 一种数据处理设备,其特征在于,包括处理器和存储器,所述处理器和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于调用所述程序指令,执行如权利要求1至14中任一项所述的方法。
  30. 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得所述计算机执行如权利要求1至14中任一项所述的方法。
  31. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在计算机上运行时,使得计算机执行如权利要求1至14中任一项所述的方法。
  32. 一种分布式集群,其特征在于,所述分布式集群包括一个或多个如权利要求15至28中任一项所述的数据处理设备。
PCT/CN2021/074108 2020-03-27 2021-01-28 一种数据处理方法和数据处理设备 WO2021190127A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21776526.2A EP4123515A4 (en) 2020-03-27 2021-01-28 Data processing method and data processing device
US17/952,848 US20230023101A1 (en) 2020-03-27 2022-09-26 Data processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010231450.7 2020-03-27
CN202010231450.7A CN113449857B (zh) 2020-03-27 2020-03-27 一种数据处理方法和数据处理设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/952,848 Continuation US20230023101A1 (en) 2020-03-27 2022-09-26 Data processing method and device

Publications (1)

Publication Number Publication Date
WO2021190127A1 true WO2021190127A1 (zh) 2021-09-30

Family

ID=77808058

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074108 WO2021190127A1 (zh) 2020-03-27 2021-01-28 一种数据处理方法和数据处理设备

Country Status (4)

Country Link
US (1) US20230023101A1 (zh)
EP (1) EP4123515A4 (zh)
CN (4) CN115456161A (zh)
WO (1) WO2021190127A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091589A (zh) * 2021-11-11 2022-02-25 北京百度网讯科技有限公司 模型训练方法、装置、电子设备及介质
CN114186687A (zh) * 2022-02-17 2022-03-15 之江实验室 一种面向神经网络模型计算的中间表示方法和装置
CN114598631A (zh) * 2022-04-28 2022-06-07 之江实验室 面向神经网络计算的分布式数据路由的建模方法和装置
WO2023241312A1 (zh) * 2022-06-16 2023-12-21 北京火山引擎科技有限公司 模型训练方法及装置
CN117827523A (zh) * 2024-03-05 2024-04-05 北京壁仞科技开发有限公司 一种模型的异常处理方法、装置、电子设备及存储介质
CN117827523B (zh) * 2024-03-05 2024-05-14 北京壁仞科技开发有限公司 一种模型的异常处理方法、装置、电子设备及存储介质

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115552477A (zh) * 2020-05-01 2022-12-30 奇跃公司 采用施加的分层归一化的图像描述符网络
CN113961351B (zh) * 2021-10-28 2022-12-30 北京百度网讯科技有限公司 深度学习模型的分布式训练方法、装置、设备及存储介质
CN114186633B (zh) * 2021-12-10 2023-04-07 北京百度网讯科技有限公司 模型的分布式训练方法、装置、设备以及存储介质
CN116501325A (zh) * 2022-01-17 2023-07-28 华为技术有限公司 一种算子的处理方法及计算机设备
CN116888601A (zh) * 2022-01-28 2023-10-13 华为技术有限公司 处理计算任务方法及装置
KR20230118486A (ko) * 2022-02-04 2023-08-11 주식회사 마키나락스 반도체 소자의 배치를 평가하는 방법
CN114429211A (zh) * 2022-02-07 2022-05-03 北京百度网讯科技有限公司 用于生成信息的方法、装置、设备、介质和产品
US20230259758A1 (en) * 2022-02-16 2023-08-17 Moffett International Co., Limited Adaptive tensor compute kernel for sparse neural network
CN115186821B (zh) * 2022-09-13 2023-01-06 之江实验室 面向芯粒的神经网络推理开销估计方法及装置、电子设备
CN117827419A (zh) * 2022-09-29 2024-04-05 华为技术有限公司 一种基于多裸片的计算方法及相关设备
CN115660049B (zh) * 2022-11-02 2023-07-25 北京百度网讯科技有限公司 模型处理方法、装置、电子设备以及存储介质
CN116227585B (zh) * 2023-05-10 2023-07-25 之江实验室 集群任务的并行执行方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293493A1 (en) * 2017-04-10 2018-10-11 Intel Corporation Abstraction layers for scalable distributed machine learning
CN109491784A (zh) * 2018-10-18 2019-03-19 北京旷视科技有限公司 降低内存占用量的方法、装置、电子设备、可读存储介质
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN110490309A (zh) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 一种用于神经网络的算子融合方法及其相关产品
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190205736A1 (en) * 2017-12-29 2019-07-04 Intel Corporation Compute optimization mechanism for deep neural networks
US10796225B2 (en) * 2018-08-03 2020-10-06 Google Llc Distributing tensor computations across computing devices
CN110689115B (zh) * 2019-09-24 2023-03-31 安徽寒武纪信息科技有限公司 神经网络模型处理方法、装置、计算机设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293493A1 (en) * 2017-04-10 2018-10-11 Intel Corporation Abstraction layers for scalable distributed machine learning
CN110321999A (zh) * 2018-03-30 2019-10-11 北京深鉴智能科技有限公司 神经网络计算图优化方法
CN109491784A (zh) * 2018-10-18 2019-03-19 北京旷视科技有限公司 降低内存占用量的方法、装置、电子设备、可读存储介质
CN110490309A (zh) * 2019-08-14 2019-11-22 北京中科寒武纪科技有限公司 一种用于神经网络的算子融合方法及其相关产品
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4123515A4

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091589A (zh) * 2021-11-11 2022-02-25 北京百度网讯科技有限公司 模型训练方法、装置、电子设备及介质
CN114091589B (zh) * 2021-11-11 2022-08-09 北京百度网讯科技有限公司 模型训练方法、装置、电子设备及介质
CN114186687A (zh) * 2022-02-17 2022-03-15 之江实验室 一种面向神经网络模型计算的中间表示方法和装置
US11823053B2 (en) 2022-02-17 2023-11-21 Zhejiang Lab Method of neural network model computation-oriented intermediate representation by constructing physical computation graph, inferring information of input and output tensor edges of each node therein, performing memory optimization on tensor edges, and optimizing physical computation graph
CN114598631A (zh) * 2022-04-28 2022-06-07 之江实验室 面向神经网络计算的分布式数据路由的建模方法和装置
US11805025B1 (en) 2022-04-28 2023-10-31 Zhejiang Lab Neural network computing-oriented modeling method and apparatus for distributed data routing
WO2023241312A1 (zh) * 2022-06-16 2023-12-21 北京火山引擎科技有限公司 模型训练方法及装置
CN117827523A (zh) * 2024-03-05 2024-04-05 北京壁仞科技开发有限公司 一种模型的异常处理方法、装置、电子设备及存储介质
CN117827523B (zh) * 2024-03-05 2024-05-14 北京壁仞科技开发有限公司 一种模型的异常处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115456159A (zh) 2022-12-09
CN115456161A (zh) 2022-12-09
EP4123515A4 (en) 2023-06-28
CN113449857B (zh) 2022-08-19
EP4123515A1 (en) 2023-01-25
US20230023101A1 (en) 2023-01-26
CN113449857A (zh) 2021-09-28
CN115456160A (zh) 2022-12-09

Similar Documents

Publication Publication Date Title
WO2021190127A1 (zh) 一种数据处理方法和数据处理设备
CN111401406B (zh) 一种神经网络训练方法、视频帧处理方法以及相关设备
WO2022068623A1 (zh) 一种模型训练方法及相关设备
US10762425B2 (en) Learning affinity via a spatial propagation neural network
WO2021233342A1 (zh) 一种神经网络构建方法以及系统
CN112084038B (zh) 神经网络的内存分配方法及装置
WO2022111617A1 (zh) 一种模型训练方法及装置
WO2023093724A1 (zh) 神经网络模型的处理方法及装置
CN113065997B (zh) 一种图像处理方法、神经网络的训练方法以及相关设备
CN113449859A (zh) 一种数据处理方法及其装置
CN112163601A (zh) 图像分类方法、系统、计算机设备及存储介质
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2020062299A1 (zh) 一种神经网络处理器、数据处理方法及相关设备
WO2021120177A1 (zh) 编译神经网络模型的方法和装置
CN111652349A (zh) 一种神经网络的处理方法及相关设备
WO2023273934A1 (zh) 一种模型超参数的选择方法及相关装置
WO2023164933A1 (zh) 一种建筑物建模方法以及相关装置
WO2023071658A1 (zh) Ai模型的处理方法、运算方法及装置
US20230153604A1 (en) Performing simulations using machine learning
WO2023122896A1 (zh) 一种数据处理方法和装置
CN111860824A (zh) 一种数据处理方法及相关产品
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
WO2021120036A1 (zh) 数据处理装置和数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21776526

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021776526

Country of ref document: EP

Effective date: 20221019

NENP Non-entry into the national phase

Ref country code: DE