WO2022178660A1 - 一种数据处理方法、装置、设备及介质 - Google Patents

一种数据处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2022178660A1
WO2022178660A1 PCT/CN2021/077413 CN2021077413W WO2022178660A1 WO 2022178660 A1 WO2022178660 A1 WO 2022178660A1 CN 2021077413 W CN2021077413 W CN 2021077413W WO 2022178660 A1 WO2022178660 A1 WO 2022178660A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
target
tensors
strategy
sub
Prior art date
Application number
PCT/CN2021/077413
Other languages
English (en)
French (fr)
Inventor
范礼
韩树发
皮华立
王洁欣
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21927108.7A priority Critical patent/EP4280107A4/en
Priority to PCT/CN2021/077413 priority patent/WO2022178660A1/zh
Priority to CN202180092652.0A priority patent/CN116868202A/zh
Publication of WO2022178660A1 publication Critical patent/WO2022178660A1/zh
Priority to US18/453,681 priority patent/US20230394110A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/24569Query processing with adaptation to specific hardware, e.g. adapted for using GPUs or SSDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a data processing method, apparatus, device and medium.
  • the neural network compiler treats the neural network (NN) model as a computational graph, firstly compiles the graph, analyzes the graph topology, converts the computational nodes into tasks on different computational engines, determines the computational order and forms the actual computational flow ( stream); and then perform operator compilation to generate computing task code that can be run on the accelerator SoC.
  • NN neural network
  • the most important optimization is how to efficiently use the on-chip cache (Cache&Buffer) to save data access overhead, reduce the limit of external storage access bandwidth, and improve data. Efficiency of loading and data computation pipelines.
  • Figure 1 is the architecture diagram of the cache running on the SoC.
  • the memory DDR101 sends data to the processor Engine103 for calculation, and the Engine103 caches the intermediate data in the Buffer102 during the calculation process, thereby Implement intermediate data caching.
  • Cache105 is a part of Engine106, Engine106 interacts with DDR104, and Engine106 determines the cache of intermediate data in Cache105.
  • the amount of data calculated by the operators in the neural network is large, and the data throughput of a task often reaches tens of MB, or even GB, while the SoC Cache and Buffer capacities are limited. Therefore, it often causes cache miss or buffer overflow and cannot be cached.
  • Embodiments of the present application provide a data processing method, apparatus, device, and medium, which are used to reduce the amount of intermediate data in neural network model processing, reduce the frequency of data exchange between the cache and external storage, and thereby avoid the situation of cache overflow.
  • an embodiment of the present application provides a data processing method, including: dividing a first tensor into at least two first sub-tensors, where the first tensors are multi-dimensional tensors to be processed; The target computation order of the two first sub-tensors, the target computation order is the sequential processing order of the at least two first sub-tensors; the at least two first sub-tensors are processed according to the target computation order.
  • the cache data reuse rate can be improved, the calculation data throughput can be increased, and the data exchange frequency between the cache and external storage can be reduced.
  • the dividing the first tensor into at least two first sub-tensors includes: inputting the first tensor into a left-handed matrix; dividing the first tensor into a left-handed matrix along an axis of the left-handed matrix. N shares, where N is a positive integer greater than or equal to 2; input the first tensor into a right-hand matrix; divide the first tensor into M shares along one axis of the right-hand matrix, where M is greater than or equal to 2
  • a positive integer of A tensor includes the at least two first sub-tensors.
  • the segmented first tensor includes at least two first sub-tensors. Cut the data cut along the left-hand matrix and the data cut along the right-hand matrix. When outputting the tensor, tensor reduction aggregation needs to be performed on two different axes, so as to obtain multiple first sub-tensors after cutting , the multiple first sub-tensors spliced together is the first tensor. Thus, the segmentation of the operator is realized by the above method.
  • the dividing the first tensor into at least two first sub-tensors includes: dividing the first tensor along one axis of the first tensor; dividing the divided A tensor is subjected to tensor reduction aggregation to obtain a segmented first tensor, and the segmented first tensor includes the at least two first sub-tensors.
  • the segmented first tensor includes at least two first sub-tensors.
  • the two first tensors divided along the same axis are aggregated to realize the division of the operator, and the obtained divided tensors include at least two first sub-tensors.
  • the method before dividing the first tensor into at least two first sub-tensors, the method further includes: obtaining all the division methods of the first tensor; obtaining the first tensor according to the mapping relationship before calculating All segmentation and aggregation flows in the depth direction of the graph, wherein the mapping relationship is the relationship between the slices of the first tensor after segmentation, and each different segmentation method corresponds to a segmentation and aggregation stream; determine the The mapping relationship transmits the splitting method corresponding to the splitting aggregation flow with the farthest distance as the target splitting method; the splitting the first tensor into at least two first sub-tensors includes: using the target splitting method to split The first tensor is divided into the at least two first sub-tensors.
  • the segmented aggregation streams corresponding to each segmentation method have different transmission distances in the depth direction of the calculation graph, and the segmentation method corresponding to the aggregated stream with the farthest transmission distance is determined as the target segmentation
  • the splitting method corresponding to this splitting method can make the operator of the first tensor pass further in the depth direction after splitting, thereby minimizing the processing amount of intermediate data and obtaining the greatest cache revenue.
  • the method before dividing the first tensor into at least two first sub-tensors, the method further includes: acquiring a training tensor, where the training tensor and the first tensor are different tensors; according to the training tensor Determine a plurality of different calculation strategies, the calculation strategy includes the number of divisions and calculation order of the training tensor; According to the different calculation strategies, training generates a target strategy model, and the target strategy model includes the same training tensor in different executions. Calculate the feedback time under the policy.
  • the target strategy model is obtained based on the training tensor by means of reinforcement learning, and the target strategy model includes the feedback time of the same training tensor under different computing strategies, so that the same training tensor can be obtained according to the target strategy model
  • the amount of feedback time under different computing strategies is executed, so that when a new tensor is obtained, maliciously determines the best computing strategy according to the target strategy model.
  • the method further includes: taking the target training tensor as a fixed input for inputting the target strategy model in each round of multiple iterations, and the target training tensor is used as a fixed input for the target strategy model. It is a different tensor from the first tensor; the target policy model is tuned according to the output result of the target policy model in the multiple rounds of iterations.
  • the strategy model can be quickly obtained based on the training set, so that the relatively long process of the evolutionary algorithm is not required, and the calculation efficiency is improved.
  • generating a target strategy model according to the different computing strategies includes: encoding the different computing strategies as gene sequences; using each computing strategy as an individual, and performing iterative verification on the gene sequence of each individual; obtaining The optimal solution converged in the iterative result is used as the calculation strategy in the target strategy model.
  • each computing strategy is sampled for uploading data, and each generation generates a new individual (computing strategy) for uploading or input to the simulator for verification, and adjusts the direction of population evolution according to the feedback results. Iterates continuously, and finally converges to the optimal solution.
  • the encoding of the different computing strategies as a gene sequence includes: encoding the computing strategy in the optimized target strategy model as the gene sequence.
  • the strategy model obtained by the reinforcement learning algorithm is used as the initial gene sequence of the evolutionary algorithm, so that the initial point of the evolutionary algorithm itself is already an ideal strategy model, which improves the performance lower limit of the evolutionary algorithm, that is, the evolutionary algorithm
  • the performance of the first policy model generated by the algorithm in the subsequent work process will not be lower than the initial policy model.
  • the initial sample of this iterative method is the policy model calculated by the enhanced algorithm, so the selection of the initial point should be better, so that the evolutionary algorithm can obtain the optimal solution faster and better. The computational efficiency is greatly improved.
  • the method further includes: inputting the target strategy model into the simulator; obtaining a feedback result output by the simulator after performing data simulation on the target strategy model, the feedback result. used to represent the performance of the target policy model; or, input the target policy model into a performance predictor; obtain a prediction result output by the performance predictor, where the prediction result is used to predict the performance of the target policy model.
  • the simulator or the performance predictor can predict the feedback time according to the preset method, replace the actual upper board verification by the prediction of the performance predictor, and feed back the performance data by the prediction method, thereby improving the calculation efficiency.
  • the method further includes: adding the target strategy model to a strategy knowledge base, and determining the target calculation order of the at least two first sub-tensors includes: acquiring the target strategy model from the strategy knowledge base; A target calculation strategy is obtained according to the target strategy model, and the target calculation strategy includes the number of divisions of the first tensor and the target calculation order.
  • the strategy search algorithm inputs the obtained model into the strategy knowledge base, so that in the subsequent work process, whenever a new tensor is input, the corresponding computing strategy can be directly queried from the strategy knowledge base to determine
  • the number of slices of the current vector does not need to re-execute the evolutionary algorithm, which shortens the time for determining the number of slices for each tensor data, and improves the computational efficiency.
  • an embodiment of the present application provides a data processing apparatus, including:
  • a segmentation unit used to segment the first tensor into at least two first sub-tensors, where the first tensors are multi-dimensional tensors to be processed;
  • an execution unit configured to determine the target calculation order of the at least two first sub-tensors divided by the dividing unit, and the target calculation order is the sequential processing order of the at least two first sub-tensors;
  • a processing unit configured to process the at least two first sub-tensors according to the target computation order determined by the execution unit.
  • the segmentation unit is also used for:
  • N is a positive integer greater than or equal to 2;
  • the first tensor after the split includes the at least two first subtensors.
  • the segmentation unit is also used for:
  • the device also includes a determining unit for:
  • mapping relationship is the relationship between the slices of the first tensor after segmentation, and each segmentation method is different Corresponding to a split aggregation flow respectively;
  • the slicing unit is also used for:
  • the first tensor is divided into the at least two first sub-tensors by the target segmentation method.
  • the device also includes a training unit for:
  • a target strategy model is generated, and the target strategy model includes the feedback time of the same training tensor under different computing strategies.
  • the training unit is also used to:
  • the target training tensor is used as a fixed input to the target policy model, and the target training tensor and the first tensor are different tensors;
  • the target policy model is tuned according to the output result of the target policy model in the multiple rounds of iterations.
  • the training unit is also used to:
  • the optimal solution converged in the iterative result is obtained as the calculation strategy in the target strategy model.
  • the training unit is also used to:
  • the computational strategy code in the optimized target strategy model is taken as the gene sequence.
  • the training unit is also used to:
  • the training unit is also used to:
  • the target calculation strategy is obtained according to the target strategy model, and the target calculation strategy includes the number of divisions of the first tensor and the target calculation order.
  • an embodiment of the present application provides a computer device, including a processor and a memory, where the processor executes the first aspect and any one of the optional implementations of the first aspect when running computer instructions stored in the memory method described.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, when the instructions are run on a computer, the computer is made to execute the first aspect and any one of the optional implementations of the first aspect. method described.
  • Figure 1 is an architecture diagram of the cache running on the SoC
  • Fig. 2 is a kind of schematic diagram of adopting graph fusion segmentation technology
  • FIG. 3 is a schematic diagram of a data processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an implementation manner of a first tensor in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a segmentation method for the first tensor in an embodiment of the present application
  • FIG. 6 is a schematic diagram of another segmentation method for the first tensor in an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a segmentation and aggregation flow in the depth direction of a computation graph according to an embodiment of the present application.
  • FIG. 9 is an architectural diagram of an evolutionary algorithm in an embodiment of the application.
  • 10 is an architectural diagram of a reinforcement learning tuning algorithm in an embodiment of the application.
  • FIG. 11 is a schematic diagram of a reinforcement learning network in an embodiment of the application.
  • 13a is a schematic diagram of different calculation sequences in the embodiment of the present application.
  • FIG. 13b is a diagram showing the correspondence between breadth-first and the boundary of a flowing subgraph in an embodiment of the application;
  • FIG. 14 is a schematic diagram of a memory encoding form of a calculation order in an embodiment of the present application.
  • 15 is a schematic diagram of a whole process of a data processing method provided by an embodiment of the application.
  • 16 is a schematic diagram of a computer device provided by an embodiment of the application.
  • FIG. 17 is a schematic diagram of a data processing apparatus provided by an embodiment of the present application.
  • the neural network compiler treats the neural network (NN) model as a computational graph, firstly compiles the graph, analyzes the graph topology, converts the computational nodes into tasks on different computational engines, determines the computational order and forms the actual computational flow ( stream); and then perform operator compilation to generate computing task code that can be run on the accelerator SoC.
  • NN neural network
  • the most important optimization is how to efficiently use the on-chip cache (Cache&Buffer) to save data access overhead, reduce the limit of external storage access bandwidth, and improve data. Efficiency of loading and data computation pipelines.
  • the amount of data calculated by operators in a neural network is large, and the data throughput of a task often reaches tens of MB, or even GB, while the SoC Cache and Buffer capacities are limited. Therefore, it often causes cache miss or buffer overflow and cannot be cached.
  • Fig. 2 shows a schematic diagram of a common graph fusion segmentation technique.
  • each square in Fig. 2 represents a segmented subgraph (group).
  • the depth direction is divided into subgraphs (groups).
  • the upper layer of each square represents the output output layer, the lower layer represents the input input layer, and the middle layer is a subgraph. Then decide which input layer and output layer to put in the buffer (keep) and which not to put (drop).
  • this technology only performs grouping in the depth direction of the calculation graph, and then decides whether to keep it in the cache according to the amount of data (tensor shape) calculated by the node, and does not perform segmentation at the tensor level. It cannot be cached, and the method is invalid.
  • this method also controls the execution order of the branch structure to a certain extent, but since there are no graph operations such as tensor segmentation and node replication, the branch structure is relatively simple, and it is not a complete calculation order control.
  • the Cache-Aware Kernel Tiling method is also used for logarithmic segmentation.
  • This method is a graph segmentation optimization technology based on GPU L2 Cache, which reduces cache misses by segmenting operator tensor data.
  • this method is only optimized for the L2 Cache of the GPU platform, and the number of slices in the fusion depth must be the same, and the fusion of different slices is not supported.
  • an embodiment of the present application provides a data processing method. By dividing tensors, the cache data reuse rate is improved, the calculation data throughput is increased, and the data exchange frequency between the cache and external storage is reduced. .
  • the data processing method provided by the embodiment of the present application includes the following steps.
  • the first tensor is a multi-dimensional tensor to be processed.
  • FIG. 4 is a schematic diagram of a possible implementation of the first tensor.
  • the first tensor may include three operators Matmul, which are the first operator 401 and the second operator respectively. sub 402 and third operator 403.
  • the slice obtained by dividing each operator separately is the above-mentioned first sub-tensor. 4
  • the first operator 401 is divided into two parts Matmul0 to Matmul1
  • the second operator 402 is divided into four parts Matmul0 to Matmul3
  • the third operator 403 is divided into four parts Matmul0 to Matmul3.
  • the amount of data of the first sub-tensor that the processor Engine needs to process each time becomes smaller, so that the intermediate data also becomes smaller accordingly, avoiding the situation of buffer overflow , so that in the subsequent work process, the cache data reuse rate can be improved, the computing data throughput can be increased, and the data exchange frequency between the cache and external storage can be reduced.
  • the first tensor can be segmented in the following two ways. They are: 1. tensor splicing concat segmentation and 2. tensor reduction reduce segmentation. For ease of understanding, the two segmentation methods are described in detail below.
  • segmentation of the first tensor through concat segmentation may be specifically implemented in the following manner.
  • the first tensor is a matrix multiplication matmul operator.
  • N is a positive integer greater than or equal to 2
  • the first tensor is segmented along one axis of the left-hand matrix, thereby realizing the step of segmenting the tensor.
  • the first tensor 501 can be divided into two parts along the M-axis along the left-handed matrix.
  • the first tensor is input into the right-hand matrix, thereby obtaining the first tensor in the right-hand matrix.
  • M is a positive integer greater than or equal to 2
  • the first tensor is segmented along one axis of the right-hand matrix, thereby realizing the step of segmenting the tensor.
  • the first tensor 502 can be divided into three parts along the N-axis along the right-hand matrix.
  • the segmented first tensor includes at least two first sub-tensors.
  • the data after cutting along the left-hand matrix (2 copies) and the data after cutting along the right-hand matrix (3 copies), and the tensor tensor needs to be a tensor on the 0-axis and 1-axis when outputting
  • the concat aggregation is reduced to obtain 6 first sub-tensors after cutting, and the 6 first sub-tensors spliced together are the first tensor 503 .
  • the concat segmentation method of the Matmul operator is realized.
  • the segmentation of the first tensor through reduce segmentation may be specifically implemented in the following manner.
  • the first tensor 601 is sliced along one axis of the first tensor, for example, the first tensor is sliced along the K axis, and the calculation result of each slice is only partial results.
  • the segmented first tensor includes at least two first sub-tensors. As shown in FIG. 6 , the two first tensors 601 split along the k-axis are aggregated by reduce to realize the reduce splitting method of the Matmul operator, and the obtained split tensors 602 include the above at least two the first subtensor.
  • both the divisible axis of the input tensor and the aggregation axis of the output tensor may have a relatively complex mapping relationship.
  • the operator tensor is divided into two parts: concat and reduce It is a large type, and carries the segmentation aggregation mapping relationship of the input and output of the corresponding operator through the operator information.
  • FIG. 7 is a mapping relationship diagram after the operators are segmented. As shown in Figure 7, for concat segmentation, the input tensor is divided into a certain axis, and the calculation result can be restored by concat aggregating the corresponding axis of the output tensor.
  • the first tensor may include multiple operators.
  • the first tensor includes three operators, and different segmentation methods may be used to segment each operator. Therefore, all the segmentation methods of the first tensor include all the permutations and combinations of different segmentation methods for each operator.
  • the mapping relationship is the relationship between the slices of the segmented first tensor, and each different segmentation method corresponds to a segmented aggregation flow.
  • Fig. 8 shows a segmentation and aggregation flow in the depth direction of a computation graph, recording the mapping relationship between operator A 801, operator B 802 and operator C 803 of the first tensor after segmentation, as shown in Fig. 8, the first and third slices input by operator A 801 have a mapping relationship with the first and third slices output, and the first and third slices output by operator A 801 are in turn with operator B.
  • the first and third slices of 802 have a mapping relationship, and at this time, the segmentation and aggregation flow of the slice in the depth direction extends the distance of one operator through the mapping relationship.
  • the split aggregation flows corresponding to each split mode have different transfer distances in the depth direction of the calculation graph, and the split corresponding to the aggregation flow with the farthest transfer distance is determined.
  • the method is the target segmentation method.
  • the segmentation method corresponding to this segmentation method can make the operator of the first tensor transfer farther in the depth direction after segmentation, thereby minimizing the processing volume of intermediate data. , to get the maximum cache benefit.
  • the first tensor is divided into at least two first sub-tensors according to the target segmentation method determined in the above manner, so as to realize the segmentation of the first tensor, the sum of the obtained sub-tensors After the interval operator is split, it propagates farther in the depth direction.
  • the first tensor is segmented through two methods of concat segmentation and reduce segmentation, thereby providing more diverse segmentation methods for multi-dimensional vectors.
  • the transmission distance of the graph in the depth direction is calculated, and the optimal segmentation method of the first tensor is determined. How many first sub-tensors are divided into the first tensor by the segmentation method, for the convenience of understanding, this is described in detail below.
  • the split cache utilization is related to many factors such as the network structure (affecting the amount of computational data in the life cycle), the amount of operator parameters, and so on.
  • the training tensor and the first tensor are different tensors.
  • the training tensors are some common types of tensors that need to be processed so that the trained model can adapt to work requirements.
  • the calculation strategy includes the number of divisions of the training tensor. Specifically, each operator in the first tensor can be divided into different numbers according to the division method determined in the foregoing steps, as different calculation strategies.
  • the measurement model includes the feedback time of the same training tensor under different computing strategies. Therefore, the strategy model obtained by the training can predict the feedback time corresponding to the tensors of different divisions, so as to determine the most suitable divisions for each tensor.
  • the optimal number of divisions of the first tensor is determined by the method of machine learning.
  • machine learning method can be implemented in different ways, which can be: 1. Evolutionary algorithm; 2. , Reinforcement Learning Algorithms, 3. Reinforcement Learning Tuning and 4. Algorithms Combining Evolutionary Algorithms and Reinforcement Learning.
  • the four different implementation manners are described in detail below.
  • the implementation of the evolutionary algorithm specifically includes the following steps.
  • different computing strategies correspond to different division numbers, and these different computing strategies are encoded as gene sequences, so that these training tensors become the input data of the evolutionary algorithm.
  • each computing strategy is sampled for uploading data, and each generation generates a new individual (computing strategy) for uploading or input to the simulator for verification, and adjusts the direction of population evolution according to the feedback results. Iterates continuously, and finally converges to the optimal solution.
  • FIG. 9 shows an architecture diagram of an evolutionary algorithm.
  • the architecture includes a policy search algorithm 901 , model compilation 902 , operator calculation tuning 903 , and board verification/simulator 904 .
  • the individuals of each generation are obtained through model compilation 902, and then through operator calculation tuning 903, these individuals are fine-tuned to achieve individual variation.
  • the mutated individuals are input into the upper board verification/simulator 904, the feedback time of these individuals is verified, and the verification results are fed back to the strategy search algorithm 901, so that the strategy search algorithm knows the current situation.
  • the feedback time of the individual iterative cycle is accessed from model compilation 902, and then through operator calculation tuning 903, these individuals are fine-tuned to achieve individual variation.
  • the mutated individuals are input into the upper board verification/simulator 904
  • the feedback time of these individuals is verified, and the verification results are fed back to the strategy search algorithm 901, so that the strategy search algorithm knows the current situation.
  • the feedback time of the individual iterative cycle is
  • the optimal solution obtained by the evolutionary algorithm is the most ideal calculation strategy, so that the first strategy model is generated, and the first strategy model is the strategy model obtained by the evolutionary algorithm.
  • the strategy search algorithm 901 is also connected with a strategy knowledge base 905 and a performance predictor 906.
  • the strategy search algorithm 901 can input these models into the strategy knowledge base. 905, so that in the subsequent work process, whenever a new tensor is input, the corresponding calculation strategy can be directly queried from the strategy knowledge base 905, so as to determine the number of splits of the current vector, and there is no need to re-execute the evolutionary algorithm
  • the execution shortens the time for determining the number of slices for each tensor data, and improves the computational efficiency.
  • the performance predictor 906 can predict the feedback time according to a preset method, replace the actual upper board verification by the prediction of the performance predictor, and feed back the performance data by the prediction method, thereby improving the calculation efficiency.
  • the implementation of the reinforcement learning algorithm specifically includes the following steps.
  • the generation method of the calculation strategy is the same as the above, and different calculation strategies correspond to different division numbers.
  • reinforcement learning acquires the target policy model based on the training set.
  • the reinforcement learning tuning algorithm uses the target training tensor as a fixed input for the input target policy model in each round of multiple iterations, and the target training tensor and the first tensor are different tensors.
  • the target policy model is tuned according to the output results of the target policy model in multiple rounds of iterations.
  • FIG. 10 shows an architecture diagram of reinforcement learning tuning.
  • the architecture includes a reinforcement learning strategy search algorithm 1001, model compilation 1002, operator calculation tuning 1003, and board verification. /simulator 1004.
  • the strategy search algorithm 1001 trains the training set, and the obtained strategy model is compiled by the model compilation 1002 and sent to the operator calculation and tuning 1003 to adjust the number of slices, and then the data is verified by the upper board verification/simulator 1004. On-board verification to understand the feedback results of the current measurement model. Finally, the optimized target policy model is obtained.
  • the architecture also includes a performance predictor 1005, which can predict the feedback time of the computing strategy, so as to determine the computing strategy more quickly.
  • the strategy search algorithm 1001 of the above reinforcement learning can be a hybrid neural network network model combining sequential units such as a GraphSage graph neural network and a long short time memory network (LSTM), a network Policy network network of the strategy generation model.
  • the number of slices for each node in the computation graph can be generated in turn.
  • FIG. 11 shows the schematic diagram of the reinforcement learning network.
  • the training set 1101 is input into the GraphSage graph neural network 1102, and the GraphSage graph neural network 1102 performs embedding feature vector Embedding on the training set 1101.
  • FC fully connected layers
  • the policy network shown in Figure 11 is only an example.
  • the network structure policy network of the policy generation model may not be limited to the structure of the above FC+LSTM, and any suitable deep generation model can be used. , such as a Transformer network, etc., which are not limited in this embodiment of the present application.
  • the strategy model can be quickly obtained based on the training set, so that the relatively long process of the evolutionary algorithm is not required, and the calculation efficiency is improved.
  • the strategy model is obtained by means of an evolutionary algorithm, a model with higher accuracy can be obtained, but the speed is slower.
  • the reinforcement learning method is selected to obtain the policy model, the policy model can be obtained faster because it does not need to go through multiple rounds of iterations of the evolutionary algorithm, and the accuracy of the model is relatively low.
  • the evolutionary algorithm is more sensitive to the selection of the initial point, in order to obtain the optimal solution after convergence faster and better, the above two methods can be combined.
  • the strategy generation model trained by reinforcement learning can be used to generate The initial population, and then the evolutionary algorithm based on the population to crossover, mutation optimization, which can improve the search efficiency, and ensure the performance lower limit of the optimal solution, to ensure that the lowest performance of the policy model can also reach the performance of the measurement model of the reinforcement learning algorithm.
  • the algorithm combining evolutionary algorithm and reinforcement learning in detail.
  • the implementation of the algorithm combining evolutionary algorithm and reinforcement learning specifically includes the following steps.
  • the second strategy model is a strategy model obtained through reinforcement learning tuning, and the second strategy model obtained by the reinforcement learning algorithm is used as the initial gene sequence of the evolutionary algorithm, so that the initial point of the evolutionary algorithm itself has been is an ideal strategy model, which improves the performance lower limit of the evolutionary algorithm, that is, the performance of the first strategy model generated by the evolutionary algorithm in the subsequent work process will not be lower than that of the second strategy model.
  • the iterative method of the evolutionary algorithm is the same as the iterative method of the aforementioned evolutionary algorithm, the difference is that the initial sample of the iterative method is the second strategy model calculated by the enhanced algorithm, so the selection of the initial point is better, so that the Evolutionary algorithms are faster and better at finding optimal solutions. The computational efficiency is greatly improved.
  • this scheme compared with the scheme of simply using the evolutionary algorithm and the reinforcement learning algorithm, this scheme combines the two schemes, so that the sampling point of the initial point of the evolutionary algorithm is more optimal, thereby improving the calculation efficiency, and at the same time ensuring The lower limit of the performance of evolutionary algorithms.
  • FIG. 12 shows an algorithm architecture diagram of the combination of evolutionary algorithm and reinforcement learning.
  • the architecture includes a strategy generation model 1201, an evolutionary search algorithm 1202, an initialization sample generation algorithm 1203, a calculation Graph compilation 1204 , operator calculation tuning 1205 , board verification 1206 , and automated search tool 1209 . And a policy knowledge base 1207 and a performance predictor 1208.
  • the strategy generation model 1201 generates a second strategy model through the reinforcement learning algorithm, and sends it to the evolutionary search algorithm 1202 to execute the calculation process of the evolutionary algorithm.
  • the evolutionary search algorithm 1202 sends the second strategy model to the initialization sample generation algorithm 1203 for processing.
  • graph compilation is performed by computational graph compilation 1204
  • operator tuning and on-board verification are performed by operator computation tuning 1205 and on-board verification 1206 respectively, thereby realizing the iterative steps of the evolutionary algorithm.
  • the automated search tool 1209 is responsible for the transfer of data.
  • the evolutionary search algorithm 1202 is also connected with a strategy knowledge base 1207 and a performance predictor 1208.
  • the evolutionary search algorithm 1202 can input these models into the strategy knowledge base 1207, Therefore, in the subsequent work process, whenever a new tensor is input, the corresponding calculation strategy can be directly queried from the strategy knowledge base 1207, so as to determine the number of divisions of the current vector, and it is not necessary to re-execute the evolutionary algorithm. , which shortens the time for determining the number of slices for each tensor data, and improves the computational efficiency.
  • the performance predictor 1208 replaces the actual upper-board verification by prediction, and feeds back performance data through the prediction method, thereby improving computing efficiency.
  • the method of dividing the first tensor and the number of divisions are determined in the above-mentioned manner, thereby realizing the dividing operation of the first tensor, and dividing the first tensor into at least two first tensors. subtensor. At this point, proceed to the next steps.
  • the target calculation order is the sequential processing order of the at least two first sub-tensors.
  • the first tensor includes three operators, and the division method and the number of divisions are determined for each operator according to the above method.
  • the first operator 401 is divided into two parts, and the first operator 401 is divided into two parts.
  • the second operator 402 is divided into 4 parts, and the third operator 403 is divided into 4 parts.
  • FIG. 13a is a schematic diagram of different calculation sequences.
  • the depth-first calculation order is: after calculating the first slice 13011 (sub-tensor) of the first operator, calculate the first slice 13012 of the next operator, and then execute After finishing the first slice of all operators, start executing the second slice of all operators in sequence until all slices of all operators have been executed.
  • FIG. 13a is a schematic diagram of different calculation sequences. As shown in 1302 of Fig. 13a, the calculation order of breadth-first is to calculate the first slice 13021 to the last slice 13022 of the first operator in turn, and then start to execute the first slice of the second operator to the end A slice until all slices of all operators are executed.
  • the above calculation sequence also indirectly defines the subgraph boundary of the multithreaded concurrent pipeline of the pipeline subgraph.
  • FIG. 13b shows the corresponding relationship between breadth-first and the boundary of the pipeline sub-graph.
  • the node 1303 of breadth-first can be regarded as the boundary node of the pipeline sub-graph 1304.
  • the depth-first order can better reuse cached data and obtain better memory benefits.
  • the depth-first order involves switching between operators, which will result in more parameter loading (such as weight and bias of conv2d), causing the data in the cache to be swapped out or the data that cannot be kept in the buffer and dropped to external storage.
  • the breadth-first order is better. Therefore, under the premise of satisfying the calculation dependencies, determining the calculation order is also a relatively complex optimization problem. It cannot be solved by manual modeling.
  • the different calculation order is also used as the encoding factor, specifically, the gene sequence is organized in the form of vector ⁇ pair ⁇ split_num,order>>.
  • the length of the vector represents the number of nodes in the topological order
  • split_num represents the split score of the node
  • order represents the calculation order, where the depth-first is denoted as D, and the breadth-first is denoted as B.
  • Fig. 14 shows a memory encoding form of calculation order.
  • the gene encoding of strategy is in the form of [4, B, 4, D, 2, D, 2, B].
  • the tensor shown in Figure 14 includes four operators, namely the first operator 1401, the second operator 1402, the third operator 1403 and the fourth operator 1404,
  • the first operator 1401 is divided into four slices op1_1 to op1_4
  • the second operator 1402 is divided into four slices op2_1 to op2_4
  • the third operator 1403 is divided into two slices op3_1 to op3_2
  • the fourth operator 1404 is divided into two slices op4_1 to op4_2.
  • slices op1_1 to op1_4 adopt a breadth-first calculation order
  • slices op2_1 to op2_4 adopt a breadth-first calculation order
  • slices op3_1 to op3_2 adopt a depth-first order
  • slices op4_1 to op4_2 adopt a depth-first order.
  • the above-mentioned gene code can be used as an individual for one iteration, and the optimization of the strategy is performed by the evolutionary algorithm, and finally a strategy model that can determine the optimal calculation order is obtained.
  • a strategy model that can determine the optimal calculation order is obtained.
  • the strategy model is obtained in the above manner, when a new tensor is input, the tensor is input into the strategy model, and then the target calculation order for the tensor can be obtained from the strategy model.
  • the division method and division number of the first tensor are respectively determined, so that the first tensor is divided into at least two first sub-tensors, and the target calculation order is also determined . Therefore, the processing of the first tensor can be realized by processing the divided at least two first sub-tensors in the target calculation order.
  • FIG. 15 provides a schematic diagram of the whole process of an embodiment of the present application.
  • the architecture of the process includes a reinforcement learning RL training module 1501 , a strategy optimization module 1502 , a testing module 1503 and a module for computing A compilation module 1504 for graph compilation.
  • the RL training module 1501 trains the strategy model according to the aforementioned reinforcement learning method, and sends the strategy model to the compilation module 1504 .
  • the strategy optimization module 1502 obtains an optimized strategy model through the aforementioned evolutionary algorithm GA search, reinforcement learning optimization or a combination of the two, and sends the optimized strategy model to the compilation module 1504, and the compilation module 1504 stores the optimized strategy model in the knowledge base. .
  • the strategy optimization module 1502 performs reinforcement learning optimization
  • the initial strategy model adopted by the compilation module 1504 is obtained from the RL training module 1501 and then sent to the strategy optimization module 1502.
  • the obtained strategy model needs to be measured by the test module 1503 before being sent to the compilation module 1504.
  • the model interacts with the real environment, so that the performance of the policy model can be evaluated.
  • the test module 1503 can also evaluate the performance of the policy model through simulator simulation or performance predictor prediction, thereby improving the work of the test module 1503. efficiency.
  • the compilation module 1504 is used for compiling the computation graph.
  • the compilation module 1504 first queries the knowledge base.
  • the optimal policy model of the tensor is used to perform subgraph/operator compilation on the first tensor, so that the first tensor can be optimally processed. If the compilation module 1504 does not find an optimal policy model that can be used to process the current first tensor in the knowledge base, the first tensor is processed through the policy model generated by the RL training module 1501 to ensure that the first tensor can be get processed.
  • the data processing method provided by the embodiment of the present application includes: dividing a first tensor into at least two first sub-tensors, where the first tensor is a multi-dimensional tensor to be processed; determining at least two first sub-tensors
  • the target calculation order of the quantity, the target calculation order is the sequential processing order of the at least two first sub-tensors; the at least two first sub-tensors are processed according to the target calculation order.
  • an embodiment of the present application provides a computer device. As shown in FIG. 16 , the device includes at least one processor 1601 , a communication line 1602 , a memory 1603 and at least one communication interface 1604 .
  • the processor 1601 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more processors for controlling the execution of programs in the present application. integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • Communication line 1602 may include a path to communicate information between the components described above.
  • Communication interface 1604 using any transceiver-like device, for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. .
  • RAN radio access network
  • WLAN wireless local area network
  • Memory 1603 may be read-only memory (ROM) or other types of static storage devices that can store non-volatile information and instructions, random access memory (RAM) or other types of static storage devices that can store information and instructions
  • ROM read-only memory
  • RAM random access memory
  • Other types of dynamic storage devices which can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM), or other optical disk storage , optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage medium or other magnetic storage device, or capable of carrying or storing desired program code in the form of instructions or data structures and Any other medium that can be accessed by a computer, but is not limited to this.
  • the memory may exist independently and be connected to the processor through communication line 1602 .
  • the memory can also be integrated with the processor.
  • the memory 1603 is used for storing computer-executed instructions for executing the solution of the present application, and the execution is controlled by the processor 1601 .
  • the processor 1601 is configured to execute the computer-executed instructions stored in the memory 1603, so as to implement the charging management method provided by the following applications of this application.
  • the computer-executed instructions in this application may also be referred to as application code, which is not specifically limited in this application.
  • the processor 1601 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 16 .
  • the electronic device may include multiple processors, such as the processor 1601 and the processor 1607 in FIG. 16 .
  • processors can be a single-core processor or a multi-core processor.
  • a processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (eg, computer program instructions).
  • the electronic device may further include an output device 1605 and an input device 1606 .
  • the output device 1605 is in communication with the processor 1601 and can display information in a variety of ways.
  • the output device 1605 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector (projector) Wait.
  • Input device 1606 is in communication with processor 1601 and can receive user input in a variety of ways.
  • the input device 1606 may be a mouse, a keyboard, a touch screen device or a sensing device, or the like.
  • the above-mentioned electronic device may be a general-purpose device or a special-purpose device.
  • the electronic device may be the device used for running the circuit noise reduction method in the embodiments of the present application. This application does not limit the type of electronic device.
  • the electronic device may be divided into functional units according to the foregoing method examples.
  • each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units. It should be noted that the division of units in the embodiments of the present application is schematic, and is only a logical function division, and other division methods may be used in actual implementation.
  • FIG. 17 shows a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • the data processing apparatus includes:
  • a segmentation unit 1701 configured to segment the first tensor into at least two first sub-tensors, where the first tensors are multi-dimensional tensors to be processed;
  • the execution unit 1702 is used to determine the target calculation order of the at least two first sub-tensors divided by the dividing unit 1701, and the target calculation order is the sequential processing order of the at least two first sub-tensors;
  • the processing unit 1703 is configured to process the at least two first sub-tensors according to the target calculation order determined by the execution unit 1702 .
  • segmentation unit 1701 is also used for:
  • N is a positive integer greater than or equal to 2;
  • the first tensor after the split includes the at least two first subtensors.
  • segmentation unit 1701 is also used for:
  • the apparatus further includes a determining unit 1704 for:
  • mapping relationship is the relationship between the slices of the first tensor after segmentation, and each segmentation method is different Corresponding to a split aggregation flow respectively;
  • the segmentation unit 1701 is also used for:
  • the first tensor is divided into the at least two first sub-tensors by the target segmentation method.
  • the apparatus further includes a training unit 1705 for:
  • a target strategy model is generated, and the target strategy model includes the feedback time of the same training tensor under different computing strategies.
  • the training unit 1705 is also used for:
  • the target training tensor is used as a fixed input to the target policy model, and the target training tensor and the first tensor are different tensors;
  • the target policy model is tuned according to the output result of the target policy model in the multiple rounds of iterations.
  • the training unit 1705 is also used for:
  • the optimal solution converged in the iterative result is obtained as the calculation strategy in the target strategy model.
  • the training unit 1705 is also used for:
  • the computational strategy code in the optimized target strategy model is taken as the gene sequence.
  • the training unit 1705 is also used for:
  • the training unit 1705 is also used for:
  • a target calculation strategy is obtained according to the target strategy model, and the target calculation strategy includes the number of divisions of the first tensor and the target calculation order.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiment or design described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例公开了一种数据处理方法,包括:将第一张量切分为至少两个第一子张量,该第一张量为待处理的多维张量;确定该至少两个第一子张量的目标计算次序,该目标计算次序为该至少两个第一子张量的先后处理顺序;根据该目标计算次序处理该至少两个第一子张量。本申请实施例还提供一种装置、设备及介质。通过将第一张量切分为至少两个第一子张量,使每次需要处理的第一子张量数据量变小,使得中间数据也相应地变小,避免了缓存溢出的情况,从而在后续的工作过程中能够提升高速缓存数据复用率,增加计算数据吞吐,减少缓存与外部存储之间的数据交换频率。

Description

一种数据处理方法、装置、设备及介质 技术领域
本申请涉及人工智能领域,尤其涉及一种数据处理方法、装置、设备及介质。
背景技术
神经网络编译器将神经网络(neural network,NN)模型看待为一个计算图,首先进行图编译,分析图拓扑,把计算节点转化为不同计算引擎上的任务,确定计算次序并形成实际计算流(stream);然后进行算子编译,生成可以在加速器SoC上运行的计算任务代码。在整个编译过程中,涉及很多图优化和算子优化的优化处理pass,其中最为重要的优化是如何高效利用片上高速缓存(Cache&Buffer)来节省数据访问开销,降低外部存储访问带宽的限制,提升数据加载和数据计算流水的效率。
如图1所示,图1为高速缓存在SoC上运行的架构图,在Buffer缓存中,内存DDR101将数据发送给处理器Engine103进行计算,Engine103在计算过程中将中间数据缓存在Buffer102中,从而实现中间数据的缓存。在Cache缓存中,Cache105属于Engine106的一部分,Engine106与DDR104交互,由Engine106来决定中间数据在Cache105中的缓存。
现有技术中,神经网络中算子计算的数据量很大,一个任务的数据吞吐往往达到几十MB,甚至上GB,而SoC Cache和Buffer容量却是有限的。因此,经常会导致cache miss或者buffer溢出而无法缓存的情况发生。
因此,现有技术中存在的上述问题还有待于解决。
发明内容
本申请实施例提供了一种数据处理方法、装置、设备及介质,用于在神经网络模型处理中减少中间数据量,减少缓存与外部存储之间的数据交换频率,从而避免缓存溢出的情况。
为解决上述技术问题,本申请实施例提供以下技术方案:
第一方面,本申请实施例提供一种数据处理方法,包括:将第一张量切分为至少两个第一子张量,该第一张量为待处理的多维张量;确定该至少两个第一子张量的目标计算次序,该目标计算次序为该至少两个第一子张量的先后处理顺序;根据该目标计算次序处理该至少两个第一子张量。
本实施例中,通过将第一张量切分为至少两个第一子张量,每次需要处理的第一子张量数据量变小,使得中间数据也相应地变小,避免了缓存溢出的情况,从而在后续的工作过程中能够提升高速缓存数据复用率,增加计算数据吞吐,减少缓存与外部存储之间的数据交换频率。
可选地,该将第一张量切分为至少两个第一子张量,包括:将该第一张量输入左手矩阵;沿该左手矩阵的一个轴将该第一张量切分为N份,该N为大于或等于2的正整数;将该第一张量输入右手矩阵;沿该右手矩阵的一个轴将该第一张量切分为M份,该M为大于或等于2的正整数;将切分后的该N份第一张量和切分后的该M份第一张量进行张量拼接聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
本实施例中,切分后的第一张量包括至少两个第一子张量。将沿左手矩阵切割后的数据切与沿右手矩阵切割后的数据切,在输出时张量则需要在两个不同的轴做张量归约聚合,从而得到切割后的多个第一子张量,该多个第一子张量拼接在一起即为第一张量。从而通过上述方式,实现了对算子的切分。
可选地,该将第一张量切分为至少两个第一子张量,包括:沿该第一张量的一个轴对该第一张量进行切分;对切分后的该第一张量进行张量归约聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
本实施例中,切分后的第一张量包括至少两个第一子张量。两个沿同一轴切分后的第一张量经过聚合,实现了算子的切分方,所得到的切分后的张量中包括至少两个第一子张量。
可选地,该将第一张量切分为至少两个第一子张量,之前,还包括:获取该第一张量的所有切分方式;根据映射关系获取该第一张量在计算图深度方向上的所有切分聚合流,其中,该映射关系为切分后的第一张量的切片之间的关系,每种不同的切分方式分别对应一种切分聚合流;确定该映射关系传递距离最远的切分聚合流所对应的切分方式作为目标切分方式;该将第一张量切分为至少两个第一子张量,包括:通过该目标切分方式将该第一张量切分为该至少两个第一子张量。
本实施例中,每种切分方式所对应的切分聚合流在计算图的深度方向上具有不同的传递距离,确定传递距离最远的一个聚合流所对应的切分方式即为目标切分方式,该种切分方式所对应的切分方法,能够使得第一张量的算子切分后在深度方向上传递得更远,从而最大地减少中间数据的处理量,取得最大的缓存收益。
可选地,该将第一张量切分为至少两个第一子张量之前,还包括:获取训练张量,该训练张量与该第一张量为不同张量;根据该训练张量确定多个不同的计算策略,该计算策略包括该训练张量的切分份数和计算次序;根据该不同的计算策略训练生成目标策略模型,该目标策略模型包括同一训练张量在执行不同计算策略下的反馈时间。
本实施例中,通过强化学习的方式基于训练张量获取目标策略模型,该目标策略模型包括同一训练张量在执行不同计算策略下的反馈时间,从而可以根据该目标策略模型获取到同一训练张量在执行不同计算策略下的反馈时间,从而在获取到一个新的张量时,恶意根据该目标策略模型确定最佳的计算策略。
可选地,该根据该不同的计算策略训练生成目标策略模型之后,还包括:在多轮迭代的每一轮均将目标训练张量作为输入该目标策略模型的固定输入,该目标训练张量与该第一张量为不同张量;根据该多轮迭代中该目标策略模型输出的结果对该目标策略模型进行调优。
本实施例中,通过强化学习的方法,能够基于训练集快速得到策略模型,从而不需要经过进化算法相对较长的过程,提升了计算效率。
可选地,该根据该不同的计算策略训练生成目标策略模型,包括:将该不同的计算策略编码为基因序列;将每个计算策略作为个体,对每个个体的基因序列进行迭代验证;获取迭代结果中收敛的最优解作为该目标策略模型中的计算策略。
本实施例中,每个计算策略作为个体,被采样上板数据,每一代产生新的个体(计算策略)进行上板或者输入到模拟器进行验证,根据反馈结果来调整种群进化的方向,通过不断地迭代,最终收敛到最优解。
可选地,该将该不同的计算策略编码为基因序列,包括:将该调优后的目标策略模型中的计算策略编码作为该基因序列。
本实施例中,强化学习算法所得到的策略模型作为进化算法的初始基因序列,从而使得进化算法的初始点本身就已经是较为理想的策略模型,这提升了进化算法的性能下限,即,进化算法在后续工作过程中所生成的第一策略模型的性能不会低于该出初始的策略模型。本迭代方法的初始样本为经过强化算法计算的策略模型,因此初始点选取要更优,从而可以使得进化算法更快更好地得到最优解。大大提升了计算效率。
可选地,该根据该不同的计算策略训练生成目标策略模型之后,还包括:将该目标策略模型输入仿真器;获取仿真器对该目标策略模型进行数据仿真后输出的反馈结果,该反馈结果用于表示该目标策略模型的性能;或者,将该目标策略模型输入性能预测器;获取该性能预测器输出的预测结果,该预测结果用于预测该目标策略模型的性能。
本实施例中,仿真器或性能预测器可以根据预置的方法,预测反馈时间,通过性能预测器的预测取代真实的上板验证,通过预测的方法反馈性能数据,从而提升计算效率。
可选地,该方法还包括:将该目标策略模型加入策略知识库中,该确定该至少两个第一子张量的目标计算次序,包括:从该策略知识库中获取该目标策略模型;根据该目标策略模型获取目标计算策略,该目标计算策略中包括该第一张量的切分份数和该目标计算次序。
本实施例中,策略搜索算法将所得到的模型输入策略知识库,从而在后续工作过程中,每当输入一个新的张量,可以从该策略知识库中直接查询对应的计算策略,从而确定当前向量的切分份数,不需要再重新进行进化算法的执行,缩短了每个张量数据确定切分份数的时间,提升了计算效率。
第二方面,本申请实施例提供一种数据处理装置,包括:
切分单元,用于将第一张量切分为至少两个第一子张量,该第一张量为待处理的多维张量;
执行单元,用于确定该切分单元切分的该至少两个第一子张量的目标计算次序,该目标计算次序为该至少两个第一子张量的先后处理顺序;
处理单元,用于根据该执行单元确定的该目标计算次序处理该至少两个第一子张量。
可选地,该切分单元,还用于:
将该第一张量输入左手矩阵;
沿该左手矩阵的一个轴将该第一张量切分为N份,该N为大于或等于2的正整数;
将该第一张量输入右手矩阵;
沿该右手矩阵的一个轴将该第一张量切分为M份,该M为大于或等于2的正整数;
将切分后的该N份第一张量和切分后的该M份第一张量进行张量拼接聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
可选地,该切分单元,还用于:
沿该第一张量的一个轴对该第一张量进行切分;
对切分后的该第一张量进行张量归约聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
可选地,该装置还包括确定单元,用于:
获取该第一张量的所有切分方式;
根据映射关系获取该第一张量在计算图深度方向上的所有切分聚合流,其中,该映射关系为切分后的第一张量的切片之间的关系,每种不同的切分方式分别对应一种切分聚合流;
确定该映射关系传递距离最远的切分聚合流所对应的切分方式作为目标切分方式;
该切分单元,还用于:
通过该目标切分方式将该第一张量切分为该至少两个第一子张量。
可选地,该装置还包括训练单元,用于:
获取训练张量,该训练张量与该第一张量为不同张量;
根据该训练张量确定多个不同的计算策略,该计算策略包括该训练张量的切分份数和计算次序;
根据该不同的计算策略训练生成目标策略模型,该目标策略模型包括同一训练张量在执行不同计算策略下的反馈时间。
可选地,该训练单元还用于:
在多轮迭代的每一轮均将目标训练张量作为输入该目标策略模型的固定输入,该目标训练张量与该第一张量为不同张量;
根据该多轮迭代中该目标策略模型输出的结果对该目标策略模型进行调优。
可选地,该训练单元还用于:
将该不同的计算策略编码为基因序列;
将每个计算策略作为个体,对每个个体的基因序列进行迭代验证;
获取迭代结果中收敛的最优解作为该目标策略模型中的计算策略。
可选地,该训练单元还用于:
将该调优后的目标策略模型中的计算策略编码作为该基因序列。
可选地,该训练单元还用于:
将该目标策略模型输入仿真器;
获取仿真器对该目标策略模型进行数据仿真后输出的反馈结果,该反馈结果用于表示该目标策略模型的性能;或者,
将该目标策略模型输入性能预测器;
获取该性能预测器输出的预测结果,该预测结果用于预测该目标策略模型的性能。
可选地,该训练单元还用于:
从该策略知识库中获取该目标策略模型;
根据该目标策略模型获取目标计算策略,该目标计算策略中包括该第一张量的切分份 数和该目标计算次序。
第三方面,本申请实施例提供一种计算机设备,包括处理器和存储器,该处理器在运行该存储器存储的计算机指令时,执行如上述第一方面及第一方面任意一种可选的实施方式所述的方法。
第四方面,本申请实施例提供一种计算机可读存储介质,包括指令,当该指令在计算机上运行时,使得计算机执行如上述第一方面及第一方面任意一种可选的实施方式所述的方法。
附图说明
图1为高速缓存在SoC上运行的架构图;
图2为一种采用图融合切分技术的示意图;
图3为本申请实施例所提供的数据处理方法的示意图;
图4为本申请实施例中第一张量的一种实现方式的示意图;
图5为本申请实施例中对第一张量的一种切分方式的示意图;
图6为本申请实施例中对第一张量的另一种切分方式的示意图;
图7为本申请实施例中对算子切分之后的映射关系图;
图8为本申请实施例中的一种计算图深度方向上的切分聚合流的示意图;
图9为本申请实施例中一种进化算法的架构图;
图10为本申请实施例中一种强化学习调优算法的架构图;
图11为本申请实施例中强化学习网络的原理图;
图12为本申请实施例中进化算法与强化学习结合的算法架构图;
图13a为本申请实施例中不同计算次序的示意图;
图13b为本申请实施例中广度优先与流水子图边界的对应关系图;
图14为本申请实施例中一种计算次序的记忆编码形式的示意图;
图15为本申请实施例所提供的数据处理方法的一种全流程示意图;
图16为本申请实施例所提供的计算机设备的示意图;
图17为本申请实施例所提供的数据处理装置的示意图。
具体实施方式
神经网络编译器将神经网络(neural network,NN)模型看待为一个计算图,首先进行图编译,分析图拓扑,把计算节点转化为不同计算引擎上的任务,确定计算次序并形成实际计算流(stream);然后进行算子编译,生成可以在加速器SoC上运行的计算任务代码。在整个编译过程中,涉及很多图优化和算子优化的优化处理pass,其中最为重要的优化是如何高效利用片上高速缓存(Cache&Buffer)来节省数据访问开销,降低外部存储访问带宽的限制,提升数据加载和数据计算流水的效率。
通常,神经网络中算子计算的数据量很大,一个任务的数据吞吐往往达到几十MB,甚至上GB,而SoC Cache和Buffer容量却是有限的。因此,经常会导致cache miss或者buffer溢出而无法缓存的情况发生。
为了解决上述问题,当前主要采取以下两种方式予以解决。
方案一。
请参阅图2,图2示出了一种采用常见的图融合切分技术的示意图,如图2所示,图2中每一个方块都代表一个切分子图(group),沿着计算图的深度方向切分子图(group),每个方块上面一层代表输出output layer,下面一层代表输入input layer,中间层为子图。然后决定哪些input layer、output layer放在buffer中(keep),哪些不放(drop)。
然而,该技术只是进行了计算图深度方向上的分组,然后根据节点计算的数据量(tensor shape)来决策是否保留在高速缓存中,并没有进行张量tensor层面的切分,一旦数据量较大无法缓存,方式失效。此外,该方法也在一定程度上控制了分支结构的执行次序,但由于没有tensor切分和节点复制等图操作,分支结构较为简单,不是彻底的计算次序控制。
方案二。
当前,还采用缓存感知内核切片Cache-Aware Kernel Tiling方法对数进行切分,该方法是基于GPU L2 Cache的图切分优化技术,通过切分算子tensor数据,减少cache miss。
然而,该方法只是针对GPU平台的L2 Cache进行优化,且在融合深度上的切分份数必须一致,不支持不同切分份数的融合。
因此,为了解决上述问题,本申请实施例提供一种数据处理方法,通过对张量进行切分,提升高速缓存数据复用率,增加计算数据吞吐,减少缓存与外部存储之间的数据交换频率。
请参阅图3,如图3所示,本申请实施例所提供的数据处理方法包括以下步骤。
301.将第一张量切分为至少两个第一子张量。
本实施例中,第一张量为待处理的多维张量。请参阅图4,图4为第一张量的一种可能的实现方式的示意图,如图4所示,第一张量中可以包括三个算子Matmul,分别为第一算子401、第二算子402和第三算子403。将对每个算子分别进行切分所得到的切片,即为上述第一子张量。如图4可见,第一算子401被切分为Matmul0至Matmul1两份,第二算子402被切分为Matmul0至Matmul3四份,第三算子403被切分为Matmul0至Matmul3四份。
通过将第一张量切分为至少两个第一子张量,处理器Engine每次需要处理的第一子张量数据量变小,使得中间数据也相应地变小,避免了缓存溢出的情况,从而在后续的工作过程中能够提升高速缓存数据复用率,增加计算数据吞吐,减少缓存与外部存储之间的数据交换频率。
可选地,可以通过以下两种方式对第一张量进行切分。分别为:一、张量拼接concat切分和二、张量归约reduce切分。为便于理解,以下分别对此两种切分方式进行详细说明。
一.张量拼接concat切分。
本实施例中,通过concat切分对第一张量进行切分具体可以通过以下方式实现。
1.将第一张量输入左手矩阵。
本实施例中,可选地,该第一张量为矩阵乘matmul算子。
2.沿左手矩阵的一个轴将第一张量切分为N份。
本实施例中,N为大于或等于2的正整数,沿左手矩阵的一个轴对第一张量进行切分,从而实现了张量切分的步骤。例如,如图5所示,可以沿左手矩阵沿M轴将数第一张量501切分为两份。
3.将第一张量输入右手矩阵。
本实施例中,将第一张量输入右手矩阵,从而得到右手矩阵中的第一张量。
4.沿右手矩阵的一个轴将第一张量切分为M份。
本实施例中,M为大于或等于2的正整数,沿右手矩阵的一个轴对第一张量进行切分,从而实现了张量切分的步骤。例如,如图5所示,可以沿右手矩阵沿N轴将数第一张量502切分为三份。
5.将切分后的N份第一张量和切分后的M份第一张量进行张量拼接聚合,得到切分后的第一张量。
本实施例中,切分后的第一张量包括至少两个第一子张量。例如图5所示,将沿左手矩阵切割后的数据切(2份)与沿右手矩阵切割后的数据切(3份),在输出时张量tensor则需要在0轴和1轴做张量归约concat聚合,从而得到切割后的6个第一子张量,该6个第一子张量拼接在一起即为第一张量503。从而通过上述方式,实现了对Matmul算子的concat切分方式。
二.张量归约reduce切分。
本实施例中,通过reduce切分对第一张量进行切分具体可以通过以下方式实现。
1.沿第一张量的一个轴对第一张量进行切分;
本实施例中,如图6所示,沿第一张量的一个轴对第一张量601进行切分,例如沿着K轴对第一张量进行切分,每个切片的计算结果只是部分结果。
2.对切分后的第一张量进行张量归约聚合,得到切分后的第一张量。
本实施例中,切分后的第一张量包括至少两个第一子张量。如图6所示,两个沿k轴切分后的第一张量601经过reduce聚合,实现了Matmul算子的reduce切分方式,所得到的切分后的张量602中包括上述至少两个第一子张量。
本实施例中,对于未知的算子,其输入tensor的可切分轴和输出tensor的聚合轴都可能有比较复杂的映射关系,本申请实施例将算子tensor切分划分为concat和reduce两大类型,并通过算子信息承载对应算子输入和输出的切分聚合映射关系。请参阅图7,图7为对算子切分之后的映射关系图。如图7所示,对于concat切分而言,切分输入tensor某根轴,可以通过concat聚合输出tensor对应轴来还原计算结果。对于Reduce切分:切分输入tensor某根轴,可以通过reduce聚合输出tensor对应轴来还原计算结果。其中,reduce function满足f(a,b)=f(f(a),f(b))。
需要说明的是,对于一个张量的切分而言,既可以采用上述concat切分方式,也可以采用上述reduce切分方式。具体工作时,需要通过以下方法确定最合适的切分方式。具体包括以下步骤。
1.获取第一张量的所有切分方式。
本实施例中,第一张量中可以包括多个算子,例如图4所示,第一张量中包括三个算 子,对每个算子分别可以采用不同的切分方式进行切分。因此,第一张量的所有切分方式中,包括对每个算子分别采用不同切分方式的所有排列组合。
2.根据映射关系获取第一张量在计算图深度方向上的所有切分聚合流。
本实施例中,映射关系为切分后的第一张量的切片之间的关系,每种不同的切分方式分别对应一种切分聚合流。例如图8示出了一种计算图深度方向上的切分聚合流,记录了第一张量的算子甲801、算子乙802和算子丙803在切分后的映射关系,如图8所示,算子甲801输入的第一和第三个切片与输出的第一和第三个切片具有映射关系,算子甲801的输出的第一和第三个切片又与算子乙802的第一和第三个切片具有映射关系,则此时该切片在深度方向上的切分聚合流通过映射关系延伸了一个算子的距离。
3.确定映射关系传递距离最远的切分聚合流所对应的切分方式作为目标切分方式。
本实施例中,如图8所示,每种切分方式所对应的切分聚合流在计算图的深度方向上具有不同的传递距离,确定传递距离最远的一个聚合流所对应的切分方式即为目标切分方式,该种切分方式所对应的切分方法,能够使得第一张量的算子切分后在深度方向上传递得更远,从而最大地减少中间数据的处理量,取得最大的缓存收益。
4.通过目标切分方式将第一张量切分为至少两个第一子张量。
本实施例中,根据上述方式所确定的目标切分方式将第一张量切分为至少两个第一子张量,从而实现的第一张量的切分,所得到的子张量之间算子切分后在深度方向上传递得更远。
本实施例中,通过concat切分和reduce切分两种方式,对第一张量进行了切分,从而对多维向量提供了更多样的切分方式,进一步地,通过第一张量的计算图在深度方向上的传递距离,确定了第一张量最优化的切分方式,在此基础上,当确定了切分方式之后,需要进一步确定切分的份数,即,需要按照确定的切分方式将第一张量切分为多少个第一子张量,为便于理解,以下对此进行详细说明。
需要说明的是,切分份数对性能的影响主要体现在以下方面:
切分多,每份计算数据量小,更容易在缓存中保持,提升数据重用率。
切分多,计算过程中所产生的工作任务task数量多,产生一定的任务启动消耗(task launch cost)。
切分多,对某些算子容易产生更多的数据搬运,如conv2d算子切fmap,会导致多次搬运weight数据。
切分的缓存利用与网络结构(影响生命周期内的计算数据量)、算子参数量等诸多因素相关。
因此,切分份数少影响缓存的数据重用,但并非越多越好,是一个比较复杂的优化问题。由于对切分数量的判断较为复杂,无法通过人工建模的方式来解决,因此可以通过机器学习方式来实现切分份数的确定。具体包括以下步骤。
1.获取训练张量.
本实施例中,训练张量与第一张量为不同张量。可选地,训练张量为需要处理的一些常见类型的张量,以使得所训练得到的模型能够适应工作需求。
2.根据训练张量确定多个不同的计算策略。
本实施例中,计算策略包括训练张量的切分份数。具体而言,可以根据前述步骤所确定的切分方式,将第一张量中的每个算子分别切分为不同的份数,作为不同的计算策略。
3.根据不同的计算策略训练生成策略模型。
本实施例中,测量模型包括同一训练张量在执行不同计算策略下的反馈时间。从而通过该训练所得到的策略模型,可以预测不同切分份数的张量所对应的反馈时间,从而确定每个张量最合适的切分份数。
本实施例中,通过机器学习的方法,确定了第一张量的最佳切分份数,需要说明的是,上述机器学习方法可以通过不同的方式实现,可以为:一、进化算法;二、强化学习算法、三、强化学习调优以及四、进化算法与强化学习结合的算法。为便于理解,以下对此四种不同的实现方式进行详细说明。
一、进化算法。
本实施例中,进化算法的实现方式具体包括以下步骤。
1.将不同的计算策略编码为基因序列。
本实施例中,不同的计算策略对应了不同的切分份数,将这些不同的计算策略编码为基因序列,从而使得这些训练张量成为进化算法的输入数据。
2.将每个计算策略作为个体,对每个个体的基因序列进行迭代验证。
本实施例中,每个计算策略作为个体,被采样上板数据,每一代产生新的个体(计算策略)进行上板或者输入到模拟器进行验证,根据反馈结果来调整种群进化的方向,通过不断地迭代,最终收敛到最优解。
可选地,图9示出了一种进化算法的架构图,如图9所示,该架构包括策略搜索算法901、模型编译902、算子计算调优903和上板验证/模拟器904。在迭代验证的过程中,通过模型编译902获得每代的个体,之后通过算子计算调优903,对这些个体进行微调,从而实现个体的变异,具体微调的方式可以为调整第一张量中部分算子的切分份数,之后将变异的个体输入上板验证/模拟器904,对这些个体的反馈时间进行验证,并将验证结果反馈给策略搜索算法901,以使得策略搜索算法知晓当前这一轮迭代个体的反馈时间情况。
3.获取迭代结果中收敛的最优解作为第一策略模型中的计算策略。
本实施例中,通过进化算法得到的最优解即为最理想的计算策略,从而生成第一策略模型,该第一策略模型为进化算法所得到的策略模型。
可选地,如图9所示,策略搜索算法901还连接有策略知识库905和性能预测器906,对于每次进化算法所得到的策略模型,策略搜索算法901可以将这些模型输入策略知识库905,从而在后续工作过程中,每当输入一个新的张量,可以从该策略知识库905中直接查询对应的计算策略,从而确定当前向量的切分份数,不需要再重新进行进化算法的执行,缩短了每个张量数据确定切分份数的时间,提升了计算效率。
同时,性能预测器906可以根据预置的方法,预测反馈时间,通过性能预测器的预测取代真实的上板验证,通过预测的方法反馈性能数据,从而提升计算效率。
二.强化学习算法。
本实施例中,强化学习算法的实现方式具体包括以下步骤。
1.将不同的计算策略作为训练集。
本实施例中,计算策略的生成方式与上述相同,不同的计算策略对应了不同的切分份数。
2.通过强化学习将训练集训练为目标策略模型。
本实施例中,强化学习基于训练集获取目标策略模型。
三、强化学习调优的方式。
本实施例中,强化学习调优算法在多轮迭代的每一轮均中,将目标训练张量作为输入目标策略模型的固定输入,目标训练张量与第一张量为不同张量。
之后,根据多轮迭代中目标策略模型输出的结果对目标策略模型进行调优。
可选地,图10示出了一种强化学习调优的架构图,如图10所示,该架构包括强化学习的策略搜索算法1001、模型编译1002、算子计算调优1003及上板验证/模拟器1004。策略搜索算法1001对训练集进行训练,所得到的策略模型由模型编译1002进行编译后发送给算子计算调优1003对切分份数进行调整,之后由上板验证/模拟器1004对数据进行上板验证,从而了解当前测量模型的反馈结果。最终得到调优后的目标策略模型。
可选地,该架构中还包括性能预测器1005,可以对计算策略的反馈时间给出预测,从而更快地确定计算策略。
可选地,上述强化学习的策略搜索算法1001可以是结合GraphSage图神经网络和长短时记忆网络(long short time memory,LSTM)等时序单元的混合神经网络网络模型,策略生成模型的网络Policy network网络可以依次生成计算图中每个节点的切分份数。请参阅图11,图11示出了强化学习网络的原理图,如图11所示,将训练集1101输入GraphSage图神经网络1102中,GraphSage图神经网络1102对训练集1101执行嵌入特征向量Embedding的操作1103,从而得到训练集1101的图向量1104,之后将该图向量1104输入策略生成模型的网络policy network,policy network1105依次通过全连接层(fully connected layers,FC)、LSTM和FC层对图向量1104进行处理,最终得到第二策略模型1106。
需要说明的是,如图11所示的policy network只是一种示例,具体工作过程中,策略生成模型的网络结构policy network可以不限定于上述FC+LSTM的结构,可以使用任何合适的深度生成模型,如Transformer网络等,对此本申请实施例并不进行限定。
本实施例中,通过强化学习的方法,能够基于训练集快速得到策略模型,从而不需要经过进化算法相对较长的过程,提升了计算效率。
需要说明的是,在具体工作过程中,若通过进化算法的方式获取策略模型,能够得到准确度更高的模型,但是速度较慢。相对地,若选择强化学习的方式获取策略模型,由于不需要经过进化算法的多轮迭代,能够更快地获取到策略模型,对模型的准确度相对较低。
进一步地,由于进化算法对于初始点选取比较敏感为了更快更好地获得收敛后的最优解,可以对上述两种方式做一个结合,例如,可以利用经过强化学习训练的策略生成模型来产生初始种群,然后进化算法基于该种群去交叉、变异寻优,这样可以提升搜索效率,并且保证最优解的性能下限,保证策略模型最低的性能也能达到强化学习算法所的测量模 型的性能。为便于理解,以下对进化算法与强化学习结合的算法进行详细说明。
四.进化算法与强化学习结合的算法。
本实施例中,进化算法与强化学习结合的算法的实现方式具体包括以下步骤。
1.将调优后的目标策略模型中的计算策略编码作为基因序列。
本实施例中,第二策略模型是通过强化学习调优所得到的策略模型,以强化学习算法所得到的第二策略模型作为进化算法的初始基因序列,从而使得进化算法的初始点本身就已经是较为理想的策略模型,这提升了进化算法的性能下限,即,进化算法在后续工作过程中所生成的第一策略模型的性能不会低于该第二策略模型。
2.将第二策略模型中的每个计算策略作为个体,对每个个体的基因序列进行迭代验证。
本实施例中,进化算法的迭代方法与前述进化算法的迭代方法相同,区别在于,本迭代方法的初始样本为经过强化算法计算的第二策略模型,因此初始点选取要更优,从而可以使得进化算法更快更好地得到最优解。大大提升了计算效率。
3.获取迭代结果中收敛的最优解作为第一策略模型中的计算策略。
本实施例中,相对于单纯采用进化算法和单纯采用强化学习算法的方案,本方案对两种方案进行了结合,使得进化算法的初始点采样点更优,从而提升了计算的效率,同时保证了进化算法的性能下限。
进一步地,请参阅图12,图12示出了进化算法与强化学习结合的算法架构图,如图12所示,该架构包括策略生成模型1201、进化搜索算法1202、初始化样本生成算法1203、计算图编译1204、算子计算调优1205、上板验证1206,自动化搜索工具1209。以及策略知识库1207和性能预测器1208。
具体工作过程中,策略生成模型1201通过强化学习算法生成第二策略模型,发送给进化搜索算法1202来执行进化算法的计算流程,进化搜索算法1202将第二策略模型发送给初始化样本生成算法1203进行初始化之后,由计算图编译1204进行图编译,之后分别通过算子计算调优1205和上板验证1206来执行算子调优和上板验证,从而实现进化算法的迭代步骤。在上述工作过程中,自动化搜索工具1209负责数据的传递。
同时,如图12所示,进化搜索算法1202还连接有策略知识库1207和性能预测器1208,对于每次进化算法所得到的策略模型,进化搜索算法1202可以将这些模型输入策略知识库1207,从而在后续工作过程中,每当输入一个新的张量,可以从该策略知识库1207中直接查询对应的计算策略,从而确定当前向量的切分份数,不需要在重新进行进化算法的执行,缩短了每个张量数据确定切分份数的时间,提升了计算效率。性能预测器1208通过预测取代真实的上板验证,通过预测的方法反馈性能数据,而提升计算效率。
本实施例中,通过上述方式,确定了第一张量的切分方式和切分份数,从而实现了第一张量的切分操作,将第一张量切分为至少两个第一子张量。此时进一步执行后续步骤。
302.确定至少两个第一子张量的目标计算次序。
本实施例中,目标计算次序为至少两个第一子张量的先后处理顺序。例如图4所示,第一张量中包括三个算子,对每个算子分别按照上述方式确定了切分方式和切分份数,其中,第一算子401被切分为2份,第二算子402被切分为4份,第三算子403被切分为4 份。对于被切分后的算子(即子张量)的计算次序,可以有两种不同的方法。分别为一、深度优先和二、广度优先。以下分别进行说明
一.深度优先。
请参阅图13a,图13a为不同计算次序的示意图。如图13a的1301所示,深度优先的计算次序为,当计算完第一个算子的第一个切片13011(子张量)后,计算下一个算子的第一个切片13012,在执行完所有算子的第一个切片后,开始依次执行所有算子的第二个切分,直到所有算子的所有切片都被执行完。
二.广度优先。
请参阅图13a,图13a为不同计算次序的示意图。如图13a的1302所示,广度优先的计算次序为,依次计算完第一个算子的第一个切片13021到最后一个切片13022,再开始执行第二个算子的第一个切片到最后一个切片,直到执行完所有算子的所有切片。
可选地,在支持复制子图并发调度执行的芯片系统上,上述计算次序也间接的定义了流水子图多线程并发pipeline的子图边界。请参阅图13b,图13b示出了广度优先与流水子图边界的对应关系图,如图13b所示,广度优先的节点1303就可以视为流水子图1304的边界节点。
在上述两种不同的计算次序中,深度优先的顺序,可以更好的复用缓存数据,能够得到更好的内存收益。然而,深度优先的顺序涉及算子之间的切换,会导致更多的参数加载(如conv2d的weight和bias),使得cache中的数据被换出或者不能keep buffer的数据而drop到外部存储。此时,广度优先的顺序反而更优。因此,在满足计算依赖的前提下,确定计算次序也是一个比较复杂的优化问题。无法通过人工建模解决。
对此,也可以通过机器学习的方式来解决上述问题。即上述一、进化算法的方式;二、强化学习算法、三、强化学习调优以及四、进化算法和强化学习结合的方式。以下分别进行详细说明。
一.进化算法。
进化算法的工作方式,在编码基因序列时,将不同的计算次序也作为编码的因素,具体地,基因序列以vector<pair<split_num,order>>方式组织。其中,vector长度代表拓扑序中存在的节点个数,split_num代表该节点的切分分数,order表示计算顺序,其中深度优先记为D,广度优先记为B。图14示出了一种计算次序的记忆编码形式,如图14所示,策略的基因编码形如[4,B,4,D,2,D,2,B]。该基因编码所表达的含义为,图14中所示的张量包括四个算子,分别即为第一算子1401,第二算子1402,第三算子1403和第四算子1404,其中,第一算子1401被切分为op1_1至op1_4四个切片,第二算子1402被切分为op2_1至op2_4四个切片,第三算子1403被切分为op3_1至op3_2两个切片,第四算子1404被切分为op4_1至op4_2两个切片。其中,切片op1_1至op1_4采用广度优先的计算次序,切片op2_1至op2_4采用广度优先的的计算次序,切片op3_1至op3_2采用深度优先的次序,切片op4_1至op4_2采用深度优先的次序。
从而上述基因编码可以作为一轮迭代的个体,由进化算法执行策略的优化,最终得到能够确定最优计算次序的策略模型。进化算法的后续步骤可参阅前述记载,此处不再赘述。
对于后面三种方式,即:二、强化学习算法、三、强化学习调优以及四、进化算法与强化学习结合的算法,均可参阅前述记载,只要在输入训练集时,将不同的计算次序也作为训练量输入模型中即可。此处不再赘述。
当通过上述方式得到策略模型后,当输入新的张量时,将该张量输入策略模型中,之后即可从该策略模型中获取到针对该张量的目标计算次序。
303.根据目标计算次序处理至少两个第一子张量。
本实施例中,通过上述方法,分别确定了第一张量的切分方式和切分份数,从而将第一张量切分为至少两个第一子张量,还确定了目标计算次序。因此通过目标计算次序处理所切分的至少两个第一子张量,即可实现对第一张量的处理。
可选地,图15提供了一种本申请实施例的全流程示意图,如图15所示,该流程的架构包括强化学习RL训练模块1501,策略寻优模块1502,测试模块1503和用于计算图编译的编译模块1504。具体工作过程中,RL训练模块1501按照前述强化学习的方法训练得到策略模型,并将该策略模型发送给编译模块1504。策略寻优模块1502通过前述的进化算法GA搜索,强化学习寻优或者二者的结合,得到优化后的策略模型,发送给编译模块1504,编译模块1504将该优化的策略模型存储在知识库中。
需要说明的是,策略寻优模块1502在执行强化学习寻优时,所采用的初始策略模型是由编译模块1504从RL训练模块1501获取到之后发送给策略寻优模块1502的。
在上述RL训练模块1501和策略寻优模块1502工作过程中,所得到的策略模型在发送给编译模块1504之前需要经过测试模块1503的测量,测试模块1503通过上板验证的方式将所得到的策略模型与真实环境互动,从而能够评估策略模型的性能,可选地,测试模块1503还可以通过仿真器仿真或性能预测器预测的方式对策略模型的性能进行评估,从而提升了测试模块1503的工作效率。
如图15所示,编译模块1504用于进行计算图编译,当输入了一个新的第一张量时,编译模块1504首先查询知识库,若在知识库中找到了能够用于处理当前第一张量的最优策略模型,则使用该最优的策略模型对该第一张量进行子图/算子编译,从而使得第一张量能够得到最优的处理。若编译模块1504没有在知识库中找到了能够用于处理当前第一张量的最优策略模型,则通过RL训练模块1501生成的策略模型来处理第一张量,从而确保第一张量能够得到处理。
本申请实施例所提供的数据处理方法,包括:将第一张量切分为至少两个第一子张量,第一张量为待处理的多维张量;确定至少两个第一子张量的目标计算次序,目标计算次序为至少两个第一子张量的先后处理顺序;根据目标计算次序处理至少两个第一子张量。通过确定切分方式、切分份数和控制算子的计算次序,可以有效的降低神经网络计算任务的数据吞吐,从而更加高效的利用缓存来提升数据访问和计算的流水执行效率。
进一步地,本申请实施例提供一种计算机设备,如图16所示,该设备包括至少一个处理器1601,通信线路1602,存储器1603以及至少一个通信接口1604。
处理器1601可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或 多个用于控制本申请方案程序执行的集成电路。
通信线路1602可包括一通路,在上述组件之间传送信息。
通信接口1604,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area network,WLAN)等。
存储器1603可以是只读存储器(read-only memory,ROM)或可存储非易失性信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路1602与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器1603用于存储执行本申请方案的计算机执行指令,并由处理器1601来控制执行。处理器1601用于执行存储器1603中存储的计算机执行指令,从而实现本申请下述申请提供的计费管理的方法。
可选的,本申请中的计算机执行指令也可以称之为应用程序代码,本申请对此不作具体限定。
在具体实现中,作为一种实施例,处理器1601可以包括一个或多个CPU,例如图16中的CPU0和CPU1。
在具体实现中,作为一种实施例,电子设备可以包括多个处理器,例如图16中的处理器1601和处理器1607。这些处理器中的每一个可以是一个单核处理器,也可以是一个多核处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,电子设备还可以包括输出设备1605和输入设备1606。输出设备1605和处理器1601通信,可以以多种方式来显示信息。例如,输出设备1605可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入设备1606和处理器1601通信,可以以多种方式接收用户的输入。例如,输入设备1606可以是鼠标、键盘、触摸屏设备或传感设备等。
上述的电子设备可以是一个通用设备或者是一个专用设备。在具体实现中,电子设备可以本申请实施例中用于运行电路降噪方法的设备。本申请不限定电子设备的类型。
本申请实施例可以根据上述方法示例对电子设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实 现时可以有另外的划分方式。
比如,以采用集成的方式划分各个功能单元的情况下,图17示出了本申请实施例所提供的一种数据处理装置的结构示意图。
如图17所示,本申请实施例所提供的数据处理装置包括。
切分单元1701,用于将第一张量切分为至少两个第一子张量,该第一张量为待处理的多维张量;
执行单元1702,用于确定该切分单元1701切分的该至少两个第一子张量的目标计算次序,该目标计算次序为该至少两个第一子张量的先后处理顺序;
处理单元1703,用于根据该执行单元1702确定的该目标计算次序处理该至少两个第一子张量。
可选地,该切分单元1701,还用于:
将该第一张量输入左手矩阵;
沿该左手矩阵的一个轴将该第一张量切分为N份,该N为大于或等于2的正整数;
将该第一张量输入右手矩阵;
沿该右手矩阵的一个轴将该第一张量切分为M份,该M为大于或等于2的正整数;
将切分后的该N份第一张量和切分后的该M份第一张量进行张量拼接聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
可选地,该切分单元1701,还用于:
沿该第一张量的一个轴对该第一张量进行切分;
对切分后的该第一张量进行张量归约聚合,得到切分后的第一张量,该切分后的第一张量包括该至少两个第一子张量。
可选地,该装置还包括确定单元1704,用于:
获取该第一张量的所有切分方式;
根据映射关系获取该第一张量在计算图深度方向上的所有切分聚合流,其中,该映射关系为切分后的第一张量的切片之间的关系,每种不同的切分方式分别对应一种切分聚合流;
确定该映射关系传递距离最远的切分聚合流所对应的切分方式作为目标切分方式;
该切分单元1701,还用于:
通过该目标切分方式将该第一张量切分为该至少两个第一子张量。
可选地,该装置还包括训练单元1705,用于:
获取训练张量,该训练张量与该第一张量为不同张量;
根据该训练张量确定多个不同的计算策略,该计算策略包括该训练张量的切分份数和计算次序;
根据该不同的计算策略训练生成目标策略模型,该目标策略模型包括同一训练张量在执行不同计算策略下的反馈时间。
可选地,该训练单元1705还用于:
在多轮迭代的每一轮均将目标训练张量作为输入该目标策略模型的固定输入,该目标 训练张量与该第一张量为不同张量;
根据该多轮迭代中该目标策略模型输出的结果对该目标策略模型进行调优。
可选地,该训练单元1705还用于:
将该不同的计算策略编码为基因序列;
将每个计算策略作为个体,对每个个体的基因序列进行迭代验证;
获取迭代结果中收敛的最优解作为该目标策略模型中的计算策略。
可选地,该训练单元1705还用于:
将该调优后的目标策略模型中的计算策略编码作为该基因序列。
可选地,该训练单元1705还用于:
将该目标策略模型输入仿真器;
获取仿真器对该目标策略模型进行数据仿真后输出的反馈结果,该反馈结果用于表示该目标策略模型的性能;或者,
将该目标策略模型输入性能预测器;
获取该性能预测器输出的预测结果,该预测结果用于预测该目标策略模型的性能。
可选地,该训练单元1705还用于:
从该策略知识库中获取该目标策略模型;
根据该目标策略模型获取目标计算策略,该目标计算策略中包括该第一张量的切分份数和该目标计算次序。
上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机执行指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。在本申请实施例中,“多个”指两个或两个以上。
本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。
以上对本申请所提供的技术方案进行了详细介绍,本申请中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (13)

  1. 一种数据处理方法,其特征在于,包括:
    将第一张量切分为至少两个第一子张量,所述第一张量为待处理的多维张量;
    确定所述至少两个第一子张量的目标计算次序,所述目标计算次序为所述至少两个第一子张量的先后处理顺序;
    根据所述目标计算次序处理所述至少两个第一子张量。
  2. 根据权利要求1所述的方法,其特征在于,所述将第一张量切分为至少两个第一子张量,包括:
    将所述第一张量输入左手矩阵;
    沿所述左手矩阵的一个轴将所述第一张量切分为N份,所述N为大于或等于2的正整数;
    将所述第一张量输入右手矩阵;
    沿所述右手矩阵的一个轴将所述第一张量切分为M份,所述M为大于或等于2的正整数;
    将切分后的所述N份第一张量和切分后的所述M份第一张量进行张量拼接聚合,得到切分后的第一张量,所述切分后的第一张量包括所述至少两个第一子张量。
  3. 根据权利要求1或2所述的方法,其特征在于,所述将第一张量切分为至少两个第一子张量,包括:
    沿所述第一张量的一个轴对所述第一张量进行切分;
    对切分后的所述第一张量进行张量归约聚合,得到切分后的第一张量,所述切分后的第一张量包括所述至少两个第一子张量。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述将第一张量切分为至少两个第一子张量,之前,还包括:
    获取所述第一张量的所有切分方式;
    根据映射关系获取所述第一张量在计算图深度方向上的所有切分聚合流,其中,所述映射关系为切分后的第一张量的切片之间的关系,每种不同的切分方式分别对应一种切分聚合流;
    确定所述映射关系传递距离最远的切分聚合流所对应的切分方式作为目标切分方式;
    所述将第一张量切分为至少两个第一子张量,包括:
    通过所述目标切分方式将所述第一张量切分为所述至少两个第一子张量。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述将第一张量切分为至少两个第一子张量之前,还包括:
    获取训练张量,所述训练张量与所述第一张量为不同张量;
    根据所述训练张量确定多个不同的计算策略,所述计算策略包括所述训练张量的切分份数和计算次序;
    根据所述不同的计算策略训练生成目标策略模型,所述目标策略模型包括同一训练张量在执行不同计算策略下的反馈时间。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述不同的计算策略训练生成目标策略模型之后,还包括:
    在多轮迭代的每一轮均将目标训练张量作为输入所述目标策略模型的固定输入,所述目标训练张量与所述第一张量为不同张量;
    根据所述多轮迭代中所述目标策略模型输出的结果对所述目标策略模型进行调优。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述不同的计算策略训练生成目标策略模型,包括:
    将所述不同的计算策略编码为基因序列;
    将每个计算策略作为个体,对每个个体的基因序列进行迭代验证;
    获取迭代结果中收敛的最优解作为所述目标策略模型中的计算策略。
  8. 根据权利要求7所述的方法,其特征在于,所述将所述不同的计算策略编码为基因序列,包括:
    将所述调优后的目标策略模型中的计算策略编码作为所述基因序列。
  9. 根据权利要求5至8任一所述的方法,其特征在于,所述根据所述不同的计算策略训练生成目标策略模型之后,还包括:
    将所述目标策略模型输入仿真器;
    获取仿真器对所述目标策略模型进行数据仿真后输出的反馈结果,所述反馈结果用于表示所述目标策略模型的性能;或者,
    将所述目标策略模型输入性能预测器;
    获取所述性能预测器输出的预测结果,所述预测结果用于预测所述目标策略模型的性能。
  10. 根据权利要求5至9任一所述的方法,其特征在于,所述方法还包括:将所述目标策略模型加入策略知识库中,所述确定所述至少两个第一子张量的目标计算次序,包括:
    从所述策略知识库中获取所述目标策略模型;
    根据所述目标策略模型获取目标计算策略,所述目标计算策略中包括所述第一张量的切分份数和所述目标计算次序。
  11. 一种数据处理装置,其特征在于,包括:
    切分单元,用于将第一张量切分为至少两个第一子张量,所述第一张量为待处理的多维张量;
    执行单元,用于确定所述切分单元切分的所述至少两个第一子张量的目标计算次序,所述目标计算次序为所述至少两个第一子张量的先后处理顺序;
    处理单元,用于根据所述执行单元确定的所述目标计算次序处理所述至少两个第一子张量。
  12. 一种计算机设备,其特征在于,包括处理器和存储器,所述处理器在运行所述存储器存储的计算机指令时,执行如权利要求1至10中任一项所述的方法。
  13. 一种计算机可读存储介质,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至10中任一项所述的方法。
PCT/CN2021/077413 2021-02-23 2021-02-23 一种数据处理方法、装置、设备及介质 WO2022178660A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP21927108.7A EP4280107A4 (en) 2021-02-23 2021-02-23 DATA PROCESSING METHOD AND DEVICE, APPARATUS AND MEDIUM
PCT/CN2021/077413 WO2022178660A1 (zh) 2021-02-23 2021-02-23 一种数据处理方法、装置、设备及介质
CN202180092652.0A CN116868202A (zh) 2021-02-23 2021-02-23 一种数据处理方法、装置、设备及介质
US18/453,681 US20230394110A1 (en) 2021-02-23 2023-08-22 Data processing method, apparatus, device, and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/077413 WO2022178660A1 (zh) 2021-02-23 2021-02-23 一种数据处理方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/453,681 Continuation US20230394110A1 (en) 2021-02-23 2023-08-22 Data processing method, apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2022178660A1 true WO2022178660A1 (zh) 2022-09-01

Family

ID=83048559

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077413 WO2022178660A1 (zh) 2021-02-23 2021-02-23 一种数据处理方法、装置、设备及介质

Country Status (4)

Country Link
US (1) US20230394110A1 (zh)
EP (1) EP4280107A4 (zh)
CN (1) CN116868202A (zh)
WO (1) WO2022178660A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269204A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于神经网络编译的内存优化方法及装置
CN116150563A (zh) * 2023-02-24 2023-05-23 之江实验室 一种业务执行方法、装置、存储介质及电子设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118012628A (zh) * 2024-03-15 2024-05-10 北京壁仞科技开发有限公司 一种数据处理方法、装置和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170168991A1 (en) * 2015-12-10 2017-06-15 Significs And Elements, Llc Systems and methods for selective expansive recursive tensor analysis
CN110263923A (zh) * 2019-08-12 2019-09-20 上海燧原智能科技有限公司 张量卷积计算方法及系统
CN110647973A (zh) * 2018-06-27 2020-01-03 北京中科寒武纪科技有限公司 运算方法及相关方法和产品
US20200410337A1 (en) * 2019-06-28 2020-12-31 Amazon Technologies, Inc Dynamic processing element array expansion

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170168991A1 (en) * 2015-12-10 2017-06-15 Significs And Elements, Llc Systems and methods for selective expansive recursive tensor analysis
CN110647973A (zh) * 2018-06-27 2020-01-03 北京中科寒武纪科技有限公司 运算方法及相关方法和产品
US20200410337A1 (en) * 2019-06-28 2020-12-31 Amazon Technologies, Inc Dynamic processing element array expansion
CN110263923A (zh) * 2019-08-12 2019-09-20 上海燧原智能科技有限公司 张量卷积计算方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4280107A4 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115269204A (zh) * 2022-09-27 2022-11-01 之江实验室 一种用于神经网络编译的内存优化方法及装置
CN115269204B (zh) * 2022-09-27 2022-12-30 之江实验室 一种用于神经网络编译的内存优化方法及装置
CN116150563A (zh) * 2023-02-24 2023-05-23 之江实验室 一种业务执行方法、装置、存储介质及电子设备
CN116150563B (zh) * 2023-02-24 2024-01-05 之江实验室 一种业务执行方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
US20230394110A1 (en) 2023-12-07
EP4280107A1 (en) 2023-11-22
CN116868202A (zh) 2023-10-10
EP4280107A4 (en) 2024-03-20

Similar Documents

Publication Publication Date Title
WO2022178660A1 (zh) 一种数据处理方法、装置、设备及介质
Huang et al. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping
WO2019067931A1 (en) SELF-ADJUSTMENT BASED ON A GRADIENT FOR LEARNING MACHINE AND DEPTH LEARNING MODELS
CN113361680B (zh) 一种神经网络架构搜索方法、装置、设备及介质
Zhang et al. Pasca: A graph neural architecture search system under the scalable paradigm
US20200125961A1 (en) Mini-machine learning
CN115543639B (zh) 分布式执行深度学习任务的优化方法和分布式系统
Li et al. Automating cloud deployment for deep learning inference of real-time online services
CN113821332B (zh) 自动机器学习系统效能调优方法、装置、设备及介质
CN111966495B (zh) 数据处理方法和装置
Zhang et al. Autosync: Learning to synchronize for data-parallel distributed deep learning
WO2017181837A1 (zh) 一种渲染程序的在线优化方法
Hafeez et al. Empirical analysis and modeling of compute times of cnn operations on aws cloud
Zhao et al. LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
CN114912041A (zh) 信息处理方法、电子设备和计算机程序产品
KR20220032861A (ko) 하드웨어에서의 성능을 고려한 뉴럴 아키텍처 서치 방법 빛 장치
CN109711555B (zh) 一种预测深度学习模型单轮迭代时间的方法和系统
Yi et al. Optimizing DNN compilation for distributed training with joint OP and tensor fusion
He et al. HOME: A holistic GPU memory management framework for deep learning
Hu et al. Hydro:{Surrogate-Based} Hyperparameter Tuning Service in Datacenters
Wu et al. A genetic-ant-colony hybrid algorithm for task scheduling in cloud system
CN111985631B (zh) 信息处理设备、信息处理方法及计算机可读记录介质
WO2021051920A1 (zh) 模型优化方法、装置、存储介质及设备
CN108846248B (zh) 一种应用建模及性能预测方法
CN108256694A (zh) 基于重复遗传算法的模糊时间序列预测系统、方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927108

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180092652.0

Country of ref document: CN

ENP Entry into the national phase

Ref document number: 2021927108

Country of ref document: EP

Effective date: 20230817

NENP Non-entry into the national phase

Ref country code: DE