WO2022141924A1 - Neural network operation method and apparatus, electronic device, and storage medium - Google Patents

Neural network operation method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2022141924A1
WO2022141924A1 PCT/CN2021/086229 CN2021086229W WO2022141924A1 WO 2022141924 A1 WO2022141924 A1 WO 2022141924A1 CN 2021086229 W CN2021086229 W CN 2021086229W WO 2022141924 A1 WO2022141924 A1 WO 2022141924A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
block
dimension
candidate
Prior art date
Application number
PCT/CN2021/086229
Other languages
French (fr)
Chinese (zh)
Inventor
徐磊
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to KR1020227010736A priority Critical patent/KR20220098341A/en
Publication of WO2022141924A1 publication Critical patent/WO2022141924A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the multiple block strategies include at least one of the following: using all input data as initial data, and based on the determined constant The dimension parameter of the data, the constant data is divided into blocks of a specified dimension, and the block result is obtained; the initial data is the initial data area allocated by the direct memory operation DMA task when the computing device runs the target neural network All constant data is used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result; part of the input data is used as the initial data Data, based on the determined dimension parameter of the constant data, perform the block of the specified dimension on the constant data to obtain a block result; wherein, the target size of some input data is the smallest according to the first dimension of the input data.
  • FIG. 4 shows a schematic diagram of software and hardware scheduling of a computing device in a method for running a neural network provided by an embodiment of the present disclosure
  • the official inference library set by the generation manufacturer of the computing device can be used to run the large-scale neural network on the computing device, but the official inference library is for a specific basic neural network.
  • the official reasoning library may be unavailable, or the computing device may be less efficient to run the optimized neural network using the official reasoning library.
  • the official reasoning library is an available reasoning deployment solution.
  • the official reasoning library can be the cdnn library of vertex dsp.
  • multiple operators and/or multiple block strategies corresponding to each network layer to be processed may be determined based on historical experience data. For example, based on historical experience data, it can be determined that multiple operators corresponding to network layer 1 to be processed include operator 1, operator 2, and operator 3, and the corresponding multiple block strategies include block strategy 1 and block strategy 2 , block strategy 4; multiple operators corresponding to network layer 2 to be processed include operator 1, operator 3, operator 4, and operator 5, and the corresponding multiple block strategies include block strategy 2, block strategy five.
  • one or more first candidate operators corresponding to the network layer to be processed may be determined from a plurality of operators. For example, according to the task of each network layer to be processed, an operator that can complete the task can be selected from multiple operators as the first candidate operator corresponding to the network layer to be processed; According to the requirements of the neural network, determine the first candidate operator corresponding to the network layer to be processed.
  • At least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed can be determined by using the calculation cost value in the following two ways:
  • the block strategy 1 is determined as the target candidate block strategy corresponding to the network layer 1 to be processed.
  • Step 1 Select a target resource consumption situation that satisfies a preset condition from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, a first candidate operator corresponds to a resource consumption under a block strategy happening.
  • Step 2 Determine the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer including the second candidate operator corresponding to the target resource consumption situation, and determine the candidate block strategy.
  • the second candidate operator and the candidate block strategy corresponding to the target overhead value may be actually measured, and the test results corresponding to each target overhead value may be obtained. That is, for each target cost value, the block strategy corresponding to the target resource consumption can be determined as a candidate block strategy, and based on the candidate block strategy, the pending operation including the second candidate operator corresponding to the target cost value can be executed.
  • the network layer determines the test result corresponding to the target cost value, that is, determines the test result corresponding to the candidate block strategy and the second candidate operator.
  • one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator may be determined based on the test results. For example, when the test result is the running time, the second candidate operator corresponding to the shortest running time can be selected as the target candidate operator of the network layer to be processed, and the candidate corresponding to the second candidate operator with the shortest running time can be divided into The block strategy is determined as the target candidate operator.
  • the first candidate operator and the second candidate operator may be operators capable of realizing the function of the network layer to be processed.
  • the resource consumption is represented by a calculation cost value
  • the calculation cost value of the first candidate operator under each block strategy can be determined according to the following steps:
  • Step 1 Determine the restricted scene corresponding to the first candidate operator under the preset size, wherein the restricted scene is the calculation time and transmission consumption based on the data capacity corresponding to the first candidate operator under the preset size. time determined;
  • Step 3 In the case where the restricted scene is a computationally restricted scene, the result of performing the segmentation based on the segmentation strategy is used to determine the calculation time-consuming and the first parameter data corresponding to the first candidate operator under the segmentation strategy.
  • the number of tasks and the DMA rate corresponding to the computing device determine the computing overhead value of the first candidate operator under the block strategy.
  • the data capacity that can be transmitted by DMA in the target time is related to the transmission speed
  • the target data capacity is related to the calculation speed.
  • the ratio is greater than 1, it indicates that the transmission speed is greater than the calculation speed (that is, the transmission time is less than the calculation time), which is Computation-limited scenarios; when the ratio is less than or equal to 1, it indicates that the transmission speed is less than or equal to the calculation speed (that is, the transmission time is greater than or equal to the calculation time), which is a bandwidth-limited scenario, and then for different restricted scenarios, You can choose different ways to determine the computational overhead value.
  • the target data capacity of the parameter data corresponding to the first candidate operator may be determined based on the preset size information of the parameter data of the first candidate operator.
  • the target data capacity may be the sum of constant data (including weight data and bias data), output data and input data.
  • the restricted scenario can then be determined based on the ratio of the data capacity that can be transmitted by the DMA within the calculated target time-consuming to the target data capacity.
  • the DMA task overhead corresponding to the computing device can be determined, in seconds (s).
  • the cycle required to create a DMA task can be converted into time, that is, DMA task overhead; and the DMA rate, that is, the DMA transfer rate, can be determined in Byte/s.
  • the calculation cost value of the first candidate operator under the block strategy may be determined by using the first cost function.
  • the total amount of DMA data transmission can be determined according to the generated DMA tasks; the number of DMA tasks can be determined based on the number of data blocks of the obtained parameter data after the parameter data is divided into blocks based on the block strategy; When one data block corresponds to one DMA task and the number of generated data blocks is 10, it is determined that there are 10 DMA tasks.
  • the total amount of DMA data transmission and the number of DMA tasks may be determined according to actual conditions, and this is only an exemplary description.
  • the first candidate operator is a convolution operator
  • the number of DMA tasks obtained after the block result can be determined according to convolution parameters such as the convolution kernel size and convolution stride corresponding to the convolution operator.
  • the operator overhead conversion width is the amount of operator transmission data determined based on the calculation time of the first candidate operator under the preset size and the size of the parameter data corresponding to the first selected operator under the block strategy. For example, when the preset size is 1024 ⁇ 1024 ⁇ 128, the calculation time of the first candidate operator under the preset size is 10 milliseconds, and the size of the segmented parameter data is 512 ⁇ 512 ⁇ 64, the first candidate operator The calculation time of the parameter data corresponding to the operator under the block strategy is 1.25 milliseconds. Then, based on the determined calculation speed and the calculation time (for example, 1.25 milliseconds) of the parameter data corresponding to the first candidate operator under the block strategy, determine the operator overhead corresponding to the first candidate operator after the block. Converted bandwidth.
  • step 2 and step 3 based on the block result, the target data capacity corresponding to the aligned parameter data of the first candidate operator, the number of operator calls, the total amount of initial data transmission, the number of DMA tasks, and Data conversion overhead.
  • an alignment operation is performed on the parameter data corresponding to the first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator; wherein, the minimum granularity information includes parameters The minimum granularity corresponding to the data in different dimensions; the size of the aligned parameter data in different dimensions is an integer multiple of the minimum granularity in the corresponding dimension indicated by the minimum granularity information.
  • the specific process of the alignment operation can be selected according to actual needs.
  • conventional data alignment methods such as padding methods, etc.
  • an alignment operation can be performed on the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator.
  • the aligned parameter data is in The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information, which reduces the probability of loss of parameter data when running the target neural network based on the target block strategy.
  • each test network includes each A target candidate operator corresponding to the network layer to be processed, and a target candidate block strategy matching the target candidate operator.
  • S2022 run multiple test networks respectively to obtain multiple test results, wherein each test network corresponds to one test result.
  • the target neural network includes a first network layer to be processed, a second network layer to be processed and a third network layer to be processed
  • the first network layer to be processed includes target candidate operator 1, target candidate The block strategy 1 corresponding to operator 1, and the block strategy 2 corresponding to target candidate operator 2 and target candidate operator 2
  • the second network layer to be processed includes target candidate operator 3 and target candidate operator 3.
  • the third to-be-processed network layer includes target candidate operator 5 and block strategy 3 corresponding to target candidate operator 5.
  • the target candidate operator and target candidate block strategy of each network layer to be processed included in the target test network may be determined as the target operator and target block corresponding to each network layer to be processed in the target neural network respectively Strategy.
  • each network layer to be processed can include a target operator matching the target block strategy, for example, target operator 1 matching the target block strategy 1; or, each The network layer to be processed may include two target operators that match the target block strategy.
  • the two target operators that match the target block strategy can be: target operators that match the target block strategy 1.
  • a threshold for the number of test networks corresponding to the target neural network may be set.
  • the dimension parameter when the specified dimension is one dimension, the dimension parameter is the first dimension; when the specified dimension is N dimension, the dimension parameter includes the first dimension to the Nth dimension, and N is greater than 2 and less than constant data. or the dimension of the input data.
  • the multiple chunking strategies include at least one of the following:
  • Scheme 2 Using all input data as initial data, and based on the determined first dimension and second dimension of the constant data, perform two-dimensional block on the constant data to obtain a block result.
  • all input data can be used as initial data, and the initial data application space can be allocated in the initial data area. Then, based on the determined first dimension of the constant data, one-dimensional block is performed on the constant data to obtain a block result. Alternatively, based on the determined first dimension and the second dimension of the constant data, two-dimensional block is performed on the constant data to obtain a block result.
  • Part of the input data can also be used as the initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data to obtain a block result; or, based on the determined first dimension and the second dimension of the input data , the input data is divided into two-dimensional blocks, and the block results are obtained.
  • a part of the input data is used as initial data, and based on the dimension parameter of the determined constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including:
  • i is the data capacity of the partial input data after the target size of the partial input data is determined, and the minimum granularity of the dimension parameter based on the constant data, the determined data capacity of the constant data block, which meets the memory requirements of the computing device positive integer of .
  • the block result indicating that the constant data allocate fails may be that after the constant data is divided according to the minimum granularity of the first dimension, the obtained constant data block and initial data do not meet the memory requirements of the computing device. If the scheduling policy is ping-pong scheduling, the input data allocate fails when twice the data capacity of the constant data block divided according to the minimum granularity of the first dimension is larger than the memory of the scheduling area of the computing device.
  • Method 1 Determine 1 time of the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as the initial data, and based on the determined first dimension of the constant data, perform a one-dimensional analysis of the constant data. Block, get a one-dimensional block result;
  • Part of the constant data of the target size is respectively used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of the specified dimensions, and the block result is obtained.
  • Method 1 Determine 1 time of the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, use the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;
  • Method 2 Determine twice the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, take the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;
  • the first dimension and the second dimension for dicing the input data can be set according to information such as operation requirements and/or operator types; and the first dimension and the second dimension for dicing the constant data can be set according to Information such as operation requirements and/or operator types are set.
  • the operator is a convolution operator
  • the first dimension of the constant data may be the output channel (hereinafter referred to as OC) dimension
  • the second dimension may be the input channel (hereinafter referred to as IC) dimension.
  • the specified dimension is one dimension and the dimension parameter includes the first dimension
  • the constant data and the input data are respectively used as target data, and based on the determined first dimension of the target data, the target data One-dimensional block is performed to obtain one-dimensional block results, including:
  • A2 when it is determined that multiple target data blocks and the initial data meet the set block conditions, take k+1 times the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to Based on the size of the target block, the target data is divided into one-dimensional blocks according to the first dimension until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the smallest particle corresponding to the first dimension of the target data is divided. k times the degree is determined as the block result.
  • the size of the target block is continuously increased, and the block result that makes the memory usage rate of the computing device higher is determined by continuous attempts, which is beneficial to reduce the waste of memory resources of the computing device.
  • step A1 k is a positive integer.
  • the minimum granularity corresponding to the first dimension of the target data is determined as the target block size, and according to the target block size, the target data is divided into one-dimensional blocks according to the first dimension, and the corresponding target data is obtained. Multiple target data blocks.
  • the obtained size of the first dimension of each target data block is consistent with the size of the target partition, and the size of each of the target data blocks except the first dimension is consistent with the size of the corresponding dimension of the target data.
  • the target block size is 32.
  • the target data is divided into one dimension according to the first dimension. block to obtain multiple target data blocks, and the size of each target data block can be 32 ⁇ 64 ⁇ 128.
  • the number of target data blocks may be determined according to actual conditions.
  • the first dimension can be set as required.
  • the first dimension of the input data can be the width W dimension
  • the second dimension can be the input channel IC dimension
  • the first dimension of the constant data can be the output channel OC dimension
  • the second dimension can be the output channel OC dimension.
  • the dimension may be the input channel IC dimension.
  • the target data is divided into one-dimensional blocks according to the first dimension, until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the minimum corresponding to the first dimension of the target data is determined.
  • the result of the partitioning is determined to be a one-dimensional partitioning failure.
  • the specified dimension is two-dimensional and the dimension parameter includes the second dimension
  • the constant data and the input data are respectively used as the target data, and based on the determined first dimension and the second dimension of the target data, Perform two-dimensional block on the target data to obtain block results, including:
  • B2 determine x times the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, divide each intermediate data block into two-dimensional blocks according to the second dimension , to obtain multiple target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;
  • y is a positive integer with an initial value of 1. For example, when the maximum value of y is set to 3, y can be determined as 1, and steps B1 to B3 are executed to obtain a two-dimensional block result; y is determined to be 2, and steps B1 to B3 are performed to obtain a two-dimensional block result; y is determined to be 3, and steps B1 to B3 are performed to obtain a two-dimensional block result, that is, three two-dimensional block results can be obtained.
  • each intermediate data block is divided into two-dimensional blocks according to the second dimension to obtain each intermediate data block.
  • the corresponding multiple target data blocks, that is, multiple target data blocks are obtained, and the size of each target data block may be 32 ⁇ 32 ⁇ 256.
  • B3 it can be judged whether the multiple target data blocks and the initial data meet the set segmentation conditions, and if so, 2 (ie x+1) times the minimum granularity corresponding to the second dimension of the target data is used as the updated
  • the second target block size is returned to based on the second target block size, and each intermediate data block is divided into two-dimensional blocks according to the second dimension, until it is determined that multiple target data blocks and initial data do not meet the set Up to the block condition, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.
  • the minimum granularity of the first dimension can be used as the first target block size
  • twice the minimum granularity of the second dimension can be used as the second target block size, based on the first target block size.
  • the target block size and the second target block size are two-dimensional blocks for the target data corresponding to the target operator of the network layer to be processed.
  • the parameter data corresponding to the network layer to be processed when the parameter data corresponding to the network layer to be processed also includes output data, determining that multiple target data blocks and initial data meet the set block conditions, including: determining initial data, output data , and each target data block respectively meet the memory requirements of the computing device, and when the initial data, output data, and each target data block respectively meet the DMA transfer requirements in the computing device, determine that multiple target data blocks and initial data satisfy The set block condition.
  • the memory requirements of the computing device may be set according to user requirements and/or computing device requirements. For example, it can be determined whether the total data capacity of the initial data, output data, and each target data block is less than or equal to the set memory capacity of the computing device, and if so, it is determined to meet the memory requirements of the computing device.
  • the data capacity of the initial data is less than or equal to the first local memory capacity allocated for the initial data on the memory of the computing device, and determine whether the data capacity of the output data is less than or equal to the output data on the memory of the computing device.
  • special memory and public memory can also be set. If the constant data is set to be stored in the public memory, and the input data and output data are stored in the special memory, the initial data, output data, and each target data block can be determined. Whether both of them meet the memory requirements of the corresponding dedicated memory and public memory, and if so, determine that the memory requirements of the computing device are met. That is, when the initial data is the input data and the target data block is the target data block corresponding to the constant data, then judge whether the data capacity of the initial data and output data is less than or equal to the set memory capacity of the dedicated memory, and judge whether each target data block is It is less than or equal to the set memory capacity of the public memory. If all are satisfied, it is determined that the memory requirements of the computing device are satisfied.
  • each target data block is determined, an attempt to allocate the target data block, initial data, and output data is performed. If the allocate attempt is successful, it is determined that the initial data, output data, and each target data block satisfy the requirements of the computing device. memory requirements.
  • the DMA transmission requirements can be determined according to actual needs. For example, if it is determined that the total data capacity of the initial data, output data, and each target data block is less than or equal to the data capacity that can be transferred by DMA, that is, when it is determined that the DMA task is successfully established, it is determined that the DMA transfer requirements in the computing device are met.
  • the output data and each target data block meet the memory requirements of the computing device and meet the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions.
  • the target block strategy including the target operator can be run based on the target block strategy corresponding to at least one network layer to be processed respectively.
  • Neural Networks After determining the target operator and the target block strategy corresponding to each network layer to be processed in the target neural network, the target block strategy including the target operator can be run based on the target block strategy corresponding to at least one network layer to be processed respectively.
  • the image to be processed can be input into the target neural network, and the computing device uses the target segmentation strategy and target operator corresponding to each network layer to be processed to perform feature extraction on the image to be processed, and determine the detection result corresponding to the image to be processed.
  • the detection result may be the category of the target object included in the image to be processed, the position information of the target object, the contour information of the target object, and the like.
  • the memory of the computing device is divided into an initial data area, a scheduling data area ping, a scheduling data area ping, an output data area ping, and an output data area ping.
  • the scheduling data is constant data; when the initial data is constant data, the scheduling data is input data.
  • an output ping (ie, output data ping) is generated, and the output ping is placed in the memory area corresponding to the output data area ping of the computing device, and the output ping is sent from the output data area of the computing device through DMA.
  • the output ping is obtained from the memory area corresponding to the ping, and then the output ping is transmitted to the corresponding external memory (such as DDR).
  • the computing device then processes the received scheduling ping, and at the same time, the DMA transmits the next scheduling ping to the memory area corresponding to the scheduling ping of the computing device, and repeats the above process until the processing of the parameter data of the layer to be processed is completed.
  • the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
  • an embodiment of the present disclosure also provides a neural network operating apparatus.
  • a schematic diagram of the architecture of the neural network operating apparatus provided by the embodiment of the present disclosure includes a first determining module 501, a second determining Module 502, running module 503, specifically:
  • the first determination module 501 is used to determine the network layer to be processed in the target neural network
  • the second determination module 502 is configured to determine, from the determined multiple operators and multiple block strategies, the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; wherein the multiple Each of the operators is used to implement the function corresponding to the network layer to be processed, and each of the multiple block strategies matches the computing device used to run the target neural network operating requirements;
  • a running module 503 is configured to run the target neural network including the target operator based on the target block strategy corresponding to the network layer to be processed.
  • the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed
  • the resource consumption of running the network layer to be processed is minimal.
  • the second determination module 502 determines the target from the determined multiple operators and multiple block strategies.
  • the target operator and target block strategy corresponding to the network layer to be processed in the neural network are used for:
  • a target candidate operator corresponding to the to-be-processed network layer is determined from the plurality of operators, and a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple block strategies The target candidate block strategy matched by the target candidate operator;
  • the target candidate block strategy is to determine the target operator and the target block strategy corresponding to each network layer to be processed.
  • the second determination module 502 determines the target operator corresponding to each network layer to be processed based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed. child and the target chunking strategy, used to:
  • a plurality of test networks corresponding to the target neural network are determined based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, in each test network Including a target candidate operator corresponding to each of the network layers to be processed, and a target candidate block strategy matching the target candidate operator;
  • the target candidate operator and target candidate block strategy of the network layer to be processed in the target test network are determined as the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network respectively.
  • the second determination module 502 for each network layer to be processed in the target neural network, determines the target candidate operator corresponding to the network layer to be processed from the plurality of operators. , and when determining a target candidate block strategy matching the target candidate operator from the multiple block strategies, used for:
  • the restricted scene belongs to a calculation-limited scene
  • determine the calculation of the parameter data corresponding to the first candidate operator under the segmentation strategy Time consumption, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; amount, data conversion overhead, DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, to determine the computing overhead value of the first candidate operator under the blocking strategy.
  • the second determining module 502 determines from the first candidate operator.
  • a candidate operator and the multiple blocking strategies select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target candidate operator.
  • a block strategy it is used to:
  • a target resource consumption situation that satisfies a preset condition is selected from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, one first candidate operator corresponds to one of the resources under a block strategy consumption;
  • one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator are determined.
  • the alignment module 504 is configured to perform an alignment operation on the parameter data corresponding to the first candidate operator based on the determined minimum granularity information corresponding to the target neural network to obtain the aligned data corresponding to the first candidate operator. parameter data;
  • the input data is divided into blocks with a specified dimension to obtain a block result
  • the constant data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the input data is according to the Minimum granularity determination of the first dimension of the input data;
  • the input data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is according to the The minimum granularity of the first dimension of the constant data is determined.
  • the constant data is divided into blocks of a specified dimension to obtain a block result, including:
  • the part of the input data of the target size is respectively used as initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result;
  • i is the data capacity of the partial input data after the target size of the partial input data is determined, and the data capacity of the constant data block determined based on the minimum granularity of the dimension parameter of the constant data, which satisfies the calculation A positive integer for the memory requirement of the device.
  • j is the data capacity of the partial constant data after the target size of the partial constant data is determined, and the determined data capacity of the input data block based on the minimum granularity of the dimension parameter of the input data, which satisfies the calculation A positive integer for the memory requirement of the device.
  • the specified dimension is one dimension and the dimension parameter includes the first dimension
  • the constant data and the input data are respectively used as target data, and based on the determined target In the first dimension of the data, one-dimensional block is performed on the target data, and a one-dimensional block result is obtained, including:
  • Block condition determining k times of the minimum granularity corresponding to the first dimension of the target data as the block result
  • the block result is a one-dimensional block failure.
  • the specified dimension is two-dimensional and the dimension parameter includes a second dimension
  • the constant data and the input data are respectively used as target data, and based on the determined target
  • the first dimension and the second dimension of the data the two-dimensional block is performed on the target data, and the two-dimensional block result is obtained, including:
  • the functions or templates included in the apparatus provided by the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments.
  • the functions or templates included in the apparatus provided by the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments.
  • FIG. 6 a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure includes a processor 601 , a memory 602 , and a bus 603 .
  • the memory 602 is used to store the execution instructions, including the memory 6021 and the external memory 6022; the memory 6021 here is also called the internal memory, and is used to temporarily store the operation data in the processor 601 and the data exchanged with the external memory 6022 such as the hard disk,
  • the processor 601 exchanges data with the external memory 6022 through the memory 6021.
  • the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:
  • the target neural network including the target operator is run based on the target block strategy corresponding to the to-be-processed network layer.
  • an embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the execution of the neural network operation method described in the foregoing method embodiment is executed. step.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program codes, and the instructions included in the program codes can be used to execute the steps of the neural network operation method described in the foregoing method embodiments.
  • the computer program product carries program codes
  • the instructions included in the program codes can be used to execute the steps of the neural network operation method described in the foregoing method embodiments.
  • the foregoing method embodiments are not repeated here.
  • the above-mentioned computer program product can be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-executable non-volatile computer-readable storage medium.
  • the computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Neurology (AREA)
  • Quality & Reliability (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Image Analysis (AREA)

Abstract

A neural network operation method and apparatus, an electronic device, and a storage medium. The method comprises: determining a network layer to be processed of a target neural network (S101); determining, from determined multiple operators and multiple chunking policies, a target operator and a target chunking policy respectively corresponding to the network layer to be processed of the target neural network (S102), each of the multiple operators being used for implementing a function corresponding to the network layer to be processed, and each of the multiple chunking policies matching an operating requirement of a computing device for operating the target neural network; and on the basis of the target chunking policy corresponding to the network layer to be processed, operating the target neural network comprising the target operator (S103).

Description

神经网络运行方法、装置、电子设备及存储介质Neural network operation method, device, electronic device and storage medium 技术领域technical field
本公开涉及深度学习技术领域,具体而言,涉及一种神经网络运行方法、装置、电子设备及存储介质。The present disclosure relates to the technical field of deep learning, and in particular, to a method, apparatus, electronic device and storage medium for operating a neural network.
背景技术Background technique
随着技术的发展,大型神经网络已被应用于各种场景下,比如,自动驾驶场景、图像识别场景等。在大型神经网络构建之后,可以通过计算设备运行该大型神经网络。With the development of technology, large-scale neural networks have been applied in various scenarios, such as autonomous driving scenarios, image recognition scenarios, etc. After a large neural network is constructed, the large neural network can be run by a computing device.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本公开至少提供一种神经网络运行方法、装置、电子设备及存储介质。In view of this, the present disclosure provides at least a method, apparatus, electronic device, and storage medium for running a neural network.
第一方面,本公开提供了一种神经网络运行方法,包括:确定目标神经网络中的待处理网络层;从确定的多个算子和多种分块策略中,确定所述目标神经网络中所述待处理网络层对应的目标算子和目标分块策略;所述多个算子中的每个算子均用于实现所述待处理网络层对应的功能,所述多种分块策略中的每个分块策略均匹配用于运行所述目标神经网络的计算设备的运行要求;基于所述待处理网络层对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。In a first aspect, the present disclosure provides a method for operating a neural network, including: determining a network layer to be processed in a target neural network; The target operator and target block strategy corresponding to the network layer to be processed; each operator in the multiple operators is used to implement the function corresponding to the network layer to be processed, and the multiple block strategies Each block strategy in the matching is used to run the operating requirements of the computing device of the target neural network; based on the target block strategy corresponding to the network layer to be processed, run the target neural network.
上述方法中,在确定了目标神经网络中的待处理网络层之后,可以从确定的多个算子和多种分块策略中,确定待处理网络层对应的目标算子和目标分块策略,由于分块策略满足计算设备的运行要求,使得基于待处理网络层对应的目标分块策略,运行包含目标算子的目标神经网络时,能够满足计算设备的运行要求。同时,由于目标分块策略可以对匹配的待处理网络层对应的目标算子的参数数据进行分块,使得基于分块后的参数数据运行待处理网络层的资源消耗最小,比如该资源消耗可以用总计算开销表征,即在满足计算设备的运行要求的同时,使得基于至少一个待处理网络层分别对应的目标分块策略,运行包含目标算子的目标神经网络的效率较高。In the above method, after the network layer to be processed in the target neural network is determined, the target operator and the target partition strategy corresponding to the network layer to be processed can be determined from the determined multiple operators and multiple partition strategies, Since the block strategy meets the operating requirements of the computing device, the target neural network including the target operator can be run based on the target block strategy corresponding to the network layer to be processed, and the operating requirements of the computing device can be met. At the same time, since the target partitioning strategy can partition the parameter data of the target operator corresponding to the matching network layer to be processed, the resource consumption of running the network layer to be processed based on the partitioned parameter data is minimized. For example, the resource consumption can be It is represented by the total computational overhead, that is, while satisfying the operating requirements of the computing device, the efficiency of running the target neural network including the target operator based on the target block strategy corresponding to at least one network layer to be processed is relatively high.
一种可能的实施方式中,所述分块策略用于对所述待处理网络层对应的目标算子的参数数据进行分块;在所述多种分块策略中,基于采用所述目标分块策略对所述目标算子的参数数据进行分块得到的参数数据,运行所述待处理网络层的资源消耗最小。In a possible implementation manner, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed; The block strategy divides the parameter data of the target operator into blocks to obtain the parameter data, and the resource consumption of running the to-be-processed network layer is minimal.
一种可能的实施方式中,在所述待处理网络层为多个的情况下,所述从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略,包括:针对所述目标神经网络中的每个待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略;在存在任一待处理网络层对应的所述目标候选算子为多个和/或所述目标候选分块策略为多个的情况下,基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的所述目标算子和所述目标分块策略。In a possible implementation manner, when the number of network layers to be processed is multiple, the network layer to be processed in the target neural network is determined from the determined multiple operators and multiple block strategies. The corresponding target operator and target block strategy include: for each to-be-processed network layer in the target neural network, determining the target candidate operator corresponding to the to-be-processed network layer from the plurality of operators, and A target candidate block strategy matching the target candidate operator is determined from the multiple block strategies; the target candidate operator corresponding to any network layer to be processed is multiple and/or the target When there are multiple candidate block strategies, the target operator and the target score corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed. block strategy.
上述实施方式中,可以先分别确定每个待处理网络层对应的目标候选算子、和与目标候选算子匹配的目标候选分块策略,实现了每个待处理网络层的目标候选算子和目标候选分块策略的局部优选。进一步的,在存在任一待处理网络层对应的目标候选算子为多个和/或目标候选分块策略为多个的情况下,基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的目标算子和目标分块策略,实现了各个待处理网络层的目标候选算子和目标候选分块策略的全局优选。In the above embodiment, the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy matching the target candidate operator can be respectively determined, so that the target candidate operator and the target candidate operator of each network layer to be processed can be realized. Local optimization of target candidate chunking strategies. Further, in the presence of multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators and targets corresponding to each network layer to be processed respectively. The candidate block strategy determines the target operator and target block strategy corresponding to each network layer to be processed, and realizes the global optimization of the target candidate operator and target candidate block strategy of each network layer to be processed.
一种可能的实施方式中,所述基于各个待处理网络层分别对应的目标候选算子和目标候选分块 策略,确定每个待处理网络层对应的所述目标算子和所述目标分块策略,包括:基于各个待处理网络层分别对应的目标候选算子、和与所述目标候选算子对应的目标候选分块策略,确定所述目标神经网络对应的多个测试网络;其中,每个测试网络中包括各个所述待处理网络层对应的一个所述目标候选算子、和与该目标候选算子匹配的一个目标候选分块策略;分别运行所述多个测试网络,得到多个测试结果,其中,每个测试网络对应一个测试结果;基于所述多个测试结果,从所述多个测试网络中选取目标测试网络;将所述目标测试网络中所述待处理网络层的目标候选算子和目标候选分块策略,确定为所述目标神经网络中所述待处理网络层对应的所述目标算子和目标分块策略。In a possible implementation manner, the target operator and the target block corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed respectively. The strategy includes: determining a plurality of test networks corresponding to the target neural network based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, each Each test network includes a target candidate operator corresponding to each of the network layers to be processed, and a target candidate block strategy matching the target candidate operator; running the plurality of test networks respectively to obtain a plurality of test results, wherein each test network corresponds to a test result; based on the multiple test results, select a target test network from the multiple test networks; The candidate operator and the target candidate block strategy are determined as the target operator and the target block strategy corresponding to the to-be-processed network layer in the target neural network.
上述实施方式中,通过基于各个待处理网络层对应的目标候选算子、和与目标候选算子对应的目标候选分块策略,确定目标神经网络对应的多个测试网络;再利用计算设备运行多个测试网络,确定每个测试网络的测试结果;基于测试结果,确定目标测试网络,比如,在测试结果为计算开销时,可以选择计算开销最小的测试网络作为目标测试网络,将目标测试网络中各个待处理网络层的目标候选算子和目标候选分块策略,确定为目标神经网络中各个待处理网络层分别对应的目标算子和目标分块策略,实现了目标算子和目标分块策略的全局优选。In the above embodiment, multiple test networks corresponding to the target neural network are determined based on the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy corresponding to the target candidate operator; A test network is used to determine the test results of each test network; based on the test results, the target test network is determined. For example, when the test result is the computational cost, the test network with the least computational cost can be selected as the target test network, and the target test network can be selected as the target test network. The target candidate operator and target candidate block strategy of each network layer to be processed are determined as the target operator and target block strategy corresponding to each network layer to be processed in the target neural network, and the target operator and target block strategy are realized. global preference.
一种可能的实施方式中,针对所述目标神经网络中的每个待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略,包括:针对所述待处理网络层,从所述多个算子中,确定一个或多个第一候选算子;基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略。In a possible implementation manner, for each to-be-processed network layer in the target neural network, a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple operators, and a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple operators. In the block strategy, determining a target candidate block strategy that matches the target candidate operator includes: for the to-be-processed network layer, from the plurality of operators, determining one or more first candidate operators; Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.
这里,可以针对每个待处理网络层,在确定了待处理网络层对应的一个或多个第一候选算子之后,可以基于第一候选算子在多种分块策略中的每种分块策略下的资源消耗情况,从第一候选算子、和多种分块策略中,选择待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略,比如,可以选择资源消耗最小的第一候选算子和分块策略,作为目标候选算子和目标候选分块策略,实现了每个待处理网络层对应的目标候选算子和目标候选分块策略的局部优选。Here, for each network layer to be processed, after one or more first candidate operators corresponding to the network layer to be processed are determined, the first candidate operator may be based on each of the multiple partitioning strategies. Resource consumption under the strategy, from the first candidate operator and a variety of segmentation strategies, select one or more target candidate operators corresponding to the network layer to be processed, and target candidate segmentation corresponding to the target candidate operator Strategies, for example, the first candidate operator and block strategy with the least resource consumption can be selected as the target candidate operator and target candidate block strategy, which realizes the target candidate operator and target candidate split corresponding to each network layer to be processed. Local optimization of block strategies.
一种可能的实施方式中,所述资源消耗情况用计算开销值表示,根据以下步骤确定所述第一候选算子在所述每种分块策略下的计算开销值:确定所述第一候选算子在预设尺寸下对应的受限场景,其中,所述受限场景为基于在预设尺寸下所述第一候选算子对应的数据容量的计算耗时和传输耗时确定的;在所述受限场景属于带宽受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所述第一候选算子在所述分块策略下对应的直接内存操作DMA数据传输总量、DMA任务个数、和数据转换开销;基于所述DMA数据传输总量、所述DMA任务个数、所述数据转换开销、和所述计算设备对应的DMA速率、DMA任务开销,确定所述第一候选算子在所述分块策略下的计算开销值;其中,所述数据转换开销为按照所述第一候选算子对应的目标数据排布方式,对所述第一候选算子对应的输入数据进行数据排布方式转换所消耗的时间;在所述受限场景属于计算受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所述第一候选算子在所述分块策略下对应的参数数据的计算耗时、所述第一候选算子的算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销;基于所述计算耗时、所述算子调用次数、初始数据传输总量、数据转换开销、DMA任务开销、DMA任务个数、和所述计算设备对应的DMA速率,确定所述第一候选算子在所述分块策略下的计算开销值。In a possible implementation manner, the resource consumption situation is represented by a computational cost value, and the computational cost value of the first candidate operator under each of the blocking strategies is determined according to the following steps: determining the first candidate operator A restricted scenario corresponding to the operator under a preset size, wherein the restricted scenario is determined based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size; In the case where the restricted scenario belongs to a limited bandwidth scenario, based on the partitioning result of the partitioning strategy, determine the direct memory operation DMA data corresponding to the first candidate operator under the partitioning strategy. The total amount of transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device, Determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is based on the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the operator to convert the data arrangement mode; in the case that the restricted scene is a computationally restricted scene, based on the segmentation result of the segmentation strategy, determine the first The calculation time of parameter data corresponding to a candidate operator under the block strategy, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; based on The calculation time-consuming, the number of operator calls, the total amount of initial data transmission, the data conversion overhead, the DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, determine the first candidate operator Computational cost value under the blocking strategy.
上述实施方式中,可以确定第一候选算子在预设尺寸下对应的受限场景,不同的受限场景对应不同计算开销值确定方式。比如,在带宽受限场景下,可以基于DMA数据传输总量、DMA任务个 数、数据转换开销、DMA速率、DMA任务开销,确定计算开销值;在计算受限场景下,可以基于计算耗时、算子调用次数、初始数据传输总量、数据转换开销、DMA任务开销、DMA任务个数、和DMA速率,确定计算开销值。In the above embodiment, the restricted scenarios corresponding to the first candidate operator under the preset size may be determined, and different restricted scenarios correspond to different calculation cost value determination methods. For example, in a bandwidth-constrained scenario, the calculation overhead value can be determined based on the total amount of DMA data transmission, the number of DMA tasks, data conversion overhead, DMA rate, and DMA task overhead; , the number of operator calls, the total amount of initial data transfer, the data conversion overhead, the DMA task overhead, the number of DMA tasks, and the DMA rate to determine the calculation overhead value.
一种可能的实施方式中,所述基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的一个或多个目标候选分块策略,包括:从所述第一候选算子对应的多个所述资源消耗情况中,选取满足预设条件的目标资源消耗情况;其中,一个第一候选算子在一种分块策略下对应一个所述资源消耗情况;将所述目标资源消耗情况对应的分块策略,确定为候选分块策略,并基于所述候选分块策略,运行包含所述目标资源消耗情况对应的第二候选算子的待处理网络层,确定与所述候选分块策略及所述第二候选算子对应的测试结果;基于所述测试结果,确定所述待处理网络层对应的一个或多个目标候选算子以及与所述目标候选算子对应的目标候选分块策略。In a possible implementation manner, based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, from the first candidate operator and all Among the multiple blocking strategies, selecting one or more target candidate operators corresponding to the network layer to be processed and one or more target candidate blocking strategies corresponding to the target candidate operators, including: Among the multiple resource consumption situations corresponding to the first candidate operator, a target resource consumption situation that satisfies a preset condition is selected; wherein, one first candidate operator corresponds to one of the resource consumption situations under a block strategy ; Determine the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the network layer to be processed comprising the second candidate operator corresponding to the target resource consumption situation , determine the test result corresponding to the candidate block strategy and the second candidate operator; based on the test result, determine one or more target candidate operators corresponding to the network layer to be processed and the target operator corresponding to the target The target candidate block strategy corresponding to the candidate operator.
采用上述方法,可以先利用资源消耗情况,从第一候选算子和多种分块策略中,选择第二候选算子、和与第二候选算子匹配的候选分块策略;并对第二候选算子和候选分块策略进行测试,再利用测试结果,确定待处理网络层对应的至少一个目标候选算子和目标候选分块策略,使得确定的待处理网络层对应的至少一个目标候选算子和目标候选分块策略为较优选择。By adopting the above method, the resource consumption situation can be used first to select the second candidate operator and the candidate block strategy matching the second candidate operator from the first candidate operator and multiple block strategies; The candidate operator and the candidate block strategy are tested, and then the test results are used to determine at least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed, so that at least one target candidate operator corresponding to the determined network layer to be processed is determined. The sub and target candidate block strategy is the better choice.
一种可能的实施方式中,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略之前,还包括:基于确定的所述目标神经网络对应的最小颗粒度信息,对所述第一候选算子对应的参数数据进行对齐操作,得到所述第一候选算子对应的对齐后的参数数据;其中,所述最小颗粒度信息中包括所述参数数据在不同维度下对应的最小颗粒度;所述对齐后的参数数据在不同维度下的尺寸,是所述最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍。In a possible implementation manner, from the first candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target Before the target candidate block strategy corresponding to the candidate operator, the method further includes: based on the determined minimum granularity information corresponding to the target neural network, performing an alignment operation on the parameter data corresponding to the first candidate operator to obtain the first candidate operator. Aligned parameter data corresponding to a candidate operator; wherein, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions, is an integer multiple of the minimum granularity in the corresponding dimension indicated by the minimum granularity information.
这里,可以基于目标神经网络对应的最小颗粒度信息,对每个第一候选算子对应的参数数据进行对齐操作,得到第一候选算子对应的对齐后的参数数据,对齐后的参数数据在不同维度下的尺寸,是最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍,降低后续基于目标分块策略,运行目标神经网络时,造成参数数据丢失的情况的发生概率。Here, based on the minimum granularity information corresponding to the target neural network, an alignment operation can be performed on the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator. The aligned parameter data is in The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information, which reduces the probability of loss of parameter data when running the target neural network based on the target block strategy.
一种可能的实施方式中,在所述参数数据包括输入数据和常数数据的情况下,所述多种分块策略包括以下至少一种:将全部输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;所述初始数据为计算设备运行所述目标神经网络时,写入直接内存操作DMA任务所分配的初始数据区域内的数据;将全部常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;将部分输入数据作为所述初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;其中,部分输入数据的目标尺寸为根据所述输入数据的第一维度的最小颗粒度确定;将部分常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;其中,部分常数数据的目标尺寸为根据所述常数数据的第一维度的最小颗粒度确定的。In a possible implementation, when the parameter data includes input data and constant data, the multiple block strategies include at least one of the following: using all input data as initial data, and based on the determined constant The dimension parameter of the data, the constant data is divided into blocks of a specified dimension, and the block result is obtained; the initial data is the initial data area allocated by the direct memory operation DMA task when the computing device runs the target neural network All constant data is used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result; part of the input data is used as the initial data Data, based on the determined dimension parameter of the constant data, perform the block of the specified dimension on the constant data to obtain a block result; wherein, the target size of some input data is the smallest according to the first dimension of the input data. Granularity determination; taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is is determined according to the minimum granularity of the first dimension of the constant data.
一种可能的实施方式中,所述将部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果,包括:基于所述输入数据的第一维度的最小颗粒度的i倍,确定所述部分输入数据的目标尺寸;分别将所述目标尺寸的所述部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;其中,i为在确定了部分输入数据的目标尺寸后,使得所述部分输入数据的数据容量, 以及基于所述常数数据的维度参数的最小颗粒度,确定的常数数据块的数据容量,满足计算设备的内存要求的正整数。In a possible implementation manner, taking part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including: i times the minimum granularity of the first dimension of the input data, determine the target size of the part of the input data; respectively take the part of the input data of the target size as the initial data, based on the determined dimension of the constant data parameter, carry out the block of the specified dimension to the constant data, and obtain the block result; wherein, i is the data capacity of the partial input data after determining the target size of the partial input data, and based on the constant data The minimum granularity of the dimension parameter, the data capacity of the determined constant data block, and a positive integer that meets the memory requirements of the computing device.
一种可能的实施方式中,所述将部分常数数据作为初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果,包括:基于所述常数数据的第一维度的最小颗粒度的j倍,确定所述部分常数数据的目标尺寸;分别将所述目标尺寸的所述部分常数数据作为初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;其中,j为在确定了部分常数数据的目标尺寸后,使得所述部分常数数据的数据容量,以及基于所述输入数据的维度参数的最小颗粒度,确定的输入数据块的数据容量,满足计算设备的内存要求的正整数。In a possible implementation manner, taking part of the constant data as initial data, and based on the determined dimension parameters of the input data, the input data is segmented with a specified dimension to obtain a segment result, including: based on the determined dimension parameters of the input data. Determine the target size of the partial constant data by j times the minimum granularity of the first dimension of the constant data; respectively take the partial constant data of the target size as initial data, based on the determined dimension of the input data parameter, the input data is divided into blocks of a specified dimension to obtain a block result; wherein, j is the data capacity of the part of the constant data after determining the target size of the part of the constant data, and based on the input data The minimum granularity of the dimension parameter, which determines the data capacity of the input data block, is a positive integer that meets the memory requirements of the computing device.
这里,设置多种分块策略,可以使得每个待处理网络层能够选择较优的目标算子和与目标算子匹配的目标分块策略。Here, setting up a variety of blocking strategies can enable each network layer to be processed to select an optimal target operator and a target blocking strategy that matches the target operator.
一种可能的实施方式中,在所述指定维度为一维,所述维度参数包括第一维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度,对所述目标数据进行一维分块,得到分块结果,包括:将所述目标数据的第一维度对应的最小颗粒度的k倍,确定为目标分块尺寸,基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块,得到所述目标数据对应的多个目标数据块;其中,k为正整数;在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,将所述目标数据的第一维度对应的最小颗粒度的k+1倍作为更新后的目标分块尺寸,返回至基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块的步骤,直至确定所述多个目标数据块和所述初始数据不满足设置的分块条件,将所述目标数据的第一维度对应的最小颗粒度的k倍确定为所述分块结果;在所述初始数据、和k等于1时生成的所述多个目标数据块不满足设置的分块条件的情况下,确定所述分块结果为一维分块失败。In a possible implementation, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension of the data, performing one-dimensional block on the target data to obtain a block result, including: determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, based on For the target block size, the target data is one-dimensionally divided according to the first dimension to obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer; after determining the plurality of target data blocks When the target data block and the initial data meet the set block conditions, take k+1 times the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to The target block size, the step of performing one-dimensional block on the target data according to the first dimension, until it is determined that the multiple target data blocks and the initial data do not meet the set block conditions, and the The k times of the minimum granularity corresponding to the first dimension of the target data is determined as the block result; the multiple target data blocks generated when the initial data and k are equal to 1 do not meet the set block conditions. In this case, it is determined that the block result is a one-dimensional block failure.
采用上述方法,不断增大目标分块尺寸,通过不断尝试的方式确定使得计算设备的内存使用率较高的分块结果,有利于减少计算设备的内存资源浪费。By using the above method, the size of the target block is continuously increased, and the block result that makes the memory usage rate of the computing device higher is determined by continuous attempts, which is beneficial to reduce the waste of memory resources of the computing device.
一种可能的实施方式中,在所述指定维度为二维,所述维度参数包括第二维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度、第二维度,对所述目标数据进行二维分块,得到分块结果,包括:将所述目标数据的第一维度对应的最小颗粒度的y倍,确定为第一目标分块尺寸,基于所述第一目标分块尺寸,将所述目标数据按照第一维度进行一维分块,得到所述目标数据对应的多个中间数据块;其中,y为正整数;将所述目标数据的第二维度对应的最小颗粒度的x倍,确定为第二目标分块尺寸;基于所述第二目标分块尺寸,将每个中间数据块按照所述第二维度进行二维分块,得到各个中间数据块分别对应的多个目标数据块;其中,x为正整数;在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,则将所述目标数据的第二维度对应的最小颗粒度的x+1倍作为更新后的第二目标分块尺寸,返回至基于所述第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块的步骤,直至确定所述多个目标数据块和所述初始数据不满足设置的分块条件为止,将所述目标数据的第二维度对应的最小颗粒度的x倍确定为所述分块结果。In a possible implementation, when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension and the second dimension of the data, performing two-dimensional segmentation on the target data to obtain a segmentation result, including: determining y times the minimum granularity corresponding to the first dimension of the target data as the first dimension The target block size, based on the first target block size, the target data is one-dimensionally divided according to the first dimension to obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer; Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform each intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer; when it is determined that the plurality of target data blocks and the initial data meet the set block conditions, Then take x+1 times of the minimum granularity corresponding to the second dimension of the target data as the updated second target block size, return to the second target block size based on the The step of performing two-dimensional partitioning in the second dimension, until it is determined that the multiple target data blocks and the initial data do not meet the set partitioning conditions, divide the minimum granularity x corresponding to the second dimension of the target data times are determined as the block result.
一种可能的实施方式中,在所述待处理网络层对应的参数数据还包括输出数据的情况下,确定所述多个目标数据块和所述初始数据满足设置的分块条件,包括:在确定所述初始数据、所述输出数据、和每个目标数据块分别满足所述计算设备的内存要求,以及所述初始数据、所述输出数据、和每个目标数据块分别满足所述计算设备中DMA传输要求的情况下,确定所述多个目标数据块和所述初始数据满足设置的分块条件。In a possible implementation manner, in the case that the parameter data corresponding to the network layer to be processed further includes output data, determining that the multiple target data blocks and the initial data meet the set block conditions, including: It is determined that the initial data, the output data, and each target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and each target data block satisfy the computing device, respectively In the case of the medium DMA transfer requirement, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.
采用上述方法,在初始数据、输出数据和每个目标数据块,满足计算设备的内存要求和计算设 备中DMA传输要求时,确定多个目标数据块和所述初始数据满足设置的分块条件,保证了分块策略与计算设备的运行要求相匹配。Using the above method, when the initial data, the output data and each target data block meet the memory requirements of the computing device and the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions, It is ensured that the partitioning strategy matches the operating requirements of the computing device.
以下装置、电子设备等的效果描述参见上述方法的说明,这里不再赘述。For descriptions of the effects of the following apparatuses, electronic devices, etc., reference may be made to the descriptions of the above-mentioned methods, which will not be repeated here.
第二方面,本公开提供了一种神经网络运行装置,包括:第一确定模块,用于确定目标神经网络中的待处理网络层;第二确定模块,用于从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略;所述多个算子中的每个算子均用于实现所述待处理网络层对应的功能,所述多种分块策略中的每个分块策略均匹配用于运行所述目标神经网络的计算设备的运行要求;运行模块,用于基于所述待处理网络层分别对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。In a second aspect, the present disclosure provides a neural network operating device, comprising: a first determination module for determining a network layer to be processed in a target neural network; a second determination module for selecting from the determined multiple operators and Among various block strategies, determine the target operator and target block strategy corresponding to the network layer to be processed in the target neural network; each operator in the multiple operators is used to realize the network to be processed. The function corresponding to the layer, each of the multiple block strategies matches the operation requirements of the computing device used to run the target neural network; the operation module is used to correspond to the network layers to be processed based on the corresponding The target partitioning strategy of , runs the target neural network including the target operator.
第三方面,本公开提供一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如上述第一方面或任一实施方式所述的神经网络运行方法的步骤。In a third aspect, the present disclosure provides an electronic device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device runs, the processor communicates with the The memories communicate with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for operating a neural network according to the first aspect or any one of the implementation manners are executed.
第四方面,本公开提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如上述第一方面或任一实施方式所述的神经网络运行方法的步骤。In a fourth aspect, the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, executes the neural network described in the first aspect or any one of the embodiments above. The steps of the network operation method.
第五方面,本公开提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行上述方法。In a fifth aspect, the present disclosure provides a computer program comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the above method.
为使本公开的上述目的、特征和优点能更明显易懂,下文特举实施例,并配合所附附图,作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and easy to understand, the following specific embodiments are given and described in detail in conjunction with the accompanying drawings.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings required in the embodiments, which are incorporated into the specification and constitute a part of the specification. The drawings illustrate embodiments consistent with the present disclosure, and together with the description serve to explain the technical solutions of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. Other related figures are obtained from these figures.
图1示出了本公开实施例所提供的一种神经网络运行方法的流程示意图;FIG. 1 shows a schematic flowchart of a method for operating a neural network provided by an embodiment of the present disclosure;
图2示出了本公开实施例所提供的一种神经网络运行方法中,确定目标神经网络中待处理网络层对应的目标算子和目标分块策略的流程示意图;2 shows a schematic flowchart of determining a target operator and a target block strategy corresponding to a network layer to be processed in a target neural network in a method for operating a neural network provided by an embodiment of the present disclosure;
图3示出了本公开实施例所提供的一种神经网络运行方法中,从多个算子中确定待处理网络层对应目标候选算子、并从多种分块策略中确定与目标候选算子匹配的目标候选分块策略的流程示意图;FIG. 3 shows that in a neural network operation method provided by an embodiment of the present disclosure, a target candidate operator corresponding to a network layer to be processed is determined from a plurality of operators, and a target candidate operator is determined from a variety of block strategies. A schematic flowchart of the sub-matching target candidate segmentation strategy;
图4示出了本公开实施例所提供的一种神经网络运行方法中,计算设备的软件硬件调度的示意图;FIG. 4 shows a schematic diagram of software and hardware scheduling of a computing device in a method for running a neural network provided by an embodiment of the present disclosure;
图5示出了本公开实施例所提供的一种神经网络运行装置的架构示意图;FIG. 5 shows a schematic diagram of the architecture of a neural network operating apparatus provided by an embodiment of the present disclosure;
图6示出了本公开实施例所提供的一种电子设备的结构示意图。FIG. 6 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are only some of the embodiments of the present disclosure, but not all of the embodiments. The components of the disclosed embodiments generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Therefore, the following detailed description of the embodiments of the disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure as claimed, but is merely representative of selected embodiments of the disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
一般的,对于依赖直接内存操作(Direct Memory Access,DMA)进行数据传输的计算设备,该计算设备的数据缓冲Data cache效率不高或者无Data cache,故在使用该计算设备推理大型神经网络时,由于计算设备的内存有限,将可能遇到对大型神经网络的单层任务进行分块(tile)、调度等问题。Generally, for a computing device that relies on Direct Memory Access (DMA) for data transmission, the data cache of the computing device is inefficient or has no Data cache. Therefore, when using the computing device to infer large neural networks, Due to the limited memory of computing devices, problems such as tiling and scheduling of single-layer tasks of large neural networks may be encountered.
具体实现时,在计算设备运行该大型神经网络时,可以使用计算设备的生成厂商设置的官方推理库,在计算设备上运行大型神经网络,但是该官方推理库是针对特定的基础神经网络的,而在用户对基础神经网络进行优化后,可能会造成官方推理库无法使用,或者,可能会造成计算设备利用该官方推理库运行优化后的神经网络的效率较低。其中,官方推理库是一种可用的推理部署方案,比如,官方推理库可以为ceva dsp的cdnn库。In the specific implementation, when the computing device runs the large-scale neural network, the official inference library set by the generation manufacturer of the computing device can be used to run the large-scale neural network on the computing device, but the official inference library is for a specific basic neural network. After the user optimizes the basic neural network, the official reasoning library may be unavailable, or the computing device may be less efficient to run the optimized neural network using the official reasoning library. Among them, the official reasoning library is an available reasoning deployment solution. For example, the official reasoning library can be the cdnn library of ceva dsp.
进一步的,针对优化后的神经网络,可以对官方推理库进行二次开发,以使得开发后的推理库能够适用于优化后的神经网络,但是开发过程成本较高、效率较低,且开发后的推理库仅适用于该优化后的神经网络,不适用于其他神经网络,使得开发后的推理库的复用率较低。Further, for the optimized neural network, the official reasoning library can be re-developed, so that the developed reasoning library can be applied to the optimized neural network, but the development process has high cost and low efficiency. The reasoning library is only suitable for the optimized neural network, not for other neural networks, so that the reuse rate of the developed reasoning library is low.
因此,为了解决上述问题,本公开实施例提供了一种神经网络运行方法、装置、电子设备及存储介质。Therefore, in order to solve the above problems, the embodiments of the present disclosure provide a method, apparatus, electronic device, and storage medium for running a neural network.
针对以上方案所存在的缺陷,均是发明人在经过实践并仔细研究后得出的结果,因此,上述问题的发现过程以及下文中本公开针对上述问题所提出的解决方案,都应该是发明人在本公开过程中对本公开做出的贡献。The defects existing in the above solutions are all the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions to the above problems proposed by the present disclosure hereinafter should be the inventors Contributions made to this disclosure during the course of this disclosure.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
为便于对本公开实施例进行理解,首先对本公开实施例所公开的一种神经网络运行方法进行详细介绍。本公开实施例所提供的神经网络运行方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备可以为运行神经网络的计算设备,也可以为其他计算设备,该计算机设备例如包括:终端设备或服务器或其它处理设备,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该神经网络运行方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。In order to facilitate the understanding of the embodiments of the present disclosure, a method for operating a neural network disclosed in the embodiments of the present disclosure is first introduced in detail. The execution body of the neural network operating method provided by the embodiments of the present disclosure is generally a computer device with a certain computing capability, and the computer device may be a computing device running a neural network, or other computing devices. The computer device includes, for example: a terminal Equipment or server or other processing equipment, terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, Personal Digital Assistant (Personal Digital Assistant, PDA), handheld device, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementations, the neural network operating method may be implemented by a processor invoking computer-readable instructions stored in a memory.
参见图1所示,为本公开实施例所提供的神经网络运行方法的流程示意图,该方法包括S101-S103,其中:Referring to FIG. 1, which is a schematic flowchart of a method for running a neural network provided by an embodiment of the present disclosure, the method includes S101-S103, wherein:
S101,确定目标神经网络中的待处理网络层。S101: Determine the network layer to be processed in the target neural network.
S102,从确定的多个算子和多种分块策略中,确定目标神经网络中待处理网络层对应的目标算子和目标分块策略。S102, from the determined multiple operators and multiple block strategies, determine a target operator and a target block strategy corresponding to the network layer to be processed in the target neural network.
其中,多个算子中的每个算子均用于实现待处理网络层对应的功能,多种分块策略中的每个分块策略均匹配用于运行目标神经网络的计算设备的运行要求。Wherein, each of the multiple operators is used to implement the function corresponding to the network layer to be processed, and each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network .
S103,基于待处理网络层对应的目标分块策略,运行包含目标算子的目标神经网络。S103, run a target neural network including a target operator based on the target block strategy corresponding to the network layer to be processed.
上述方法中,在确定了目标神经网络中的待处理网络层之后,可以从确定的多个算子和多种分 块策略中,确定待处理网络层对应的目标算子和目标分块策略,由于分块策略满足计算设备的运行要求,使得基于待处理网络层对应的目标分块策略,运行包含目标算子的目标神经网络时,能够满足计算设备的运行要求。同时,由于目标分块策略可以对匹配的待处理网络层对应的目标算子的参数数据进行分块,使得基于分块后的参数数据运行待处理网络层的资源消耗最小,比如该资源消耗可以用总计算开销表征,即在满足计算设备的运行要求的同时,使得基于至少一个待处理网络层分别对应的目标分块策略,运行包含目标算子的目标神经网络的效率较高。In the above method, after the network layer to be processed in the target neural network is determined, the target operator and the target partition strategy corresponding to the network layer to be processed can be determined from the determined multiple operators and multiple partition strategies, Since the block strategy meets the operating requirements of the computing device, the target neural network including the target operator can be run based on the target block strategy corresponding to the network layer to be processed, and the operating requirements of the computing device can be met. At the same time, since the target partitioning strategy can partition the parameter data of the target operator corresponding to the matching network layer to be processed, the resource consumption of running the network layer to be processed based on the partitioned parameter data is minimized. For example, the resource consumption can be It is represented by the total computational overhead, that is, while satisfying the operating requirements of the computing device, the efficiency of running the target neural network including the target operator based on the target block strategy corresponding to at least one network layer to be processed is relatively high.
在一种可能的实施方式中,目标神经网络用于实现图像处理任务,所述图像处理任务包括:图像识别、图像分类、图像分割、关键点检测中的至少一种。In a possible implementation manner, the target neural network is used to implement image processing tasks, and the image processing tasks include: at least one of image recognition, image classification, image segmentation, and key point detection.
下述对S101-S103进行具体说明。S101-S103 will be specifically described below.
针对S101:这里,目标神经网络可以为进行图层次的优化处理(即图优化处理)后的任一神经网络,图优化处理后的神经网络为计算图已确定的神经网络,即图优化处理后的神经网络为每一网络层的任务和参数尺寸已确定的神经网络,每一网络层的参数尺寸可以为该网络层中包括的参数数据的尺寸。比如,目标神经网络的第一网络层的任务可以为特征提取,在第一网络层包括输入数据时,输入数据的参数尺寸可以为256×256×128。其中,每一网络层的任务和参数尺寸可以根据实际情况进行设置,此处仅为示例性说明。For S101: Here, the target neural network can be any neural network after graph-level optimization processing (ie, graph optimization processing), and the neural network after graph optimization processing is the neural network whose computation graph has been determined, that is, after graph optimization processing The neural network is a neural network in which the task and parameter size of each network layer have been determined, and the parameter size of each network layer may be the size of parameter data included in the network layer. For example, the task of the first network layer of the target neural network may be feature extraction. When the first network layer includes input data, the parameter size of the input data may be 256×256×128. The task and parameter size of each network layer can be set according to the actual situation, which is only an exemplary description here.
待处理网络层可以为目标神经网络上的待处理的网络操作(operator,OP)层。比如,待处理网络层可以为目标神经网络中尺寸大于设定阈值的网络OP层;或者,也可以为用户根据需要选择的网络OP层。确定的待处理网络层的数量可以为一个或多个。The to-be-processed network layer may be a to-be-processed network operation (operator, OP) layer on the target neural network. For example, the network layer to be processed may be the network OP layer in the target neural network whose size is greater than the set threshold; or, may also be the network OP layer selected by the user as required. The determined number of network layers to be processed may be one or more.
示例性的,每个网络OP层可以近似为卷积层,比如,全连接层可以近似为特征图与卷积核尺寸一致的卷积层,不存在权值的常规层可以使为权值为0的卷积层等。Exemplarily, each network OP layer can be approximated as a convolutional layer. For example, a fully connected layer can be approximated as a convolutional layer with the same feature map and convolution kernel size, and a conventional layer without weights can be used as weights. 0 convolutional layers, etc.
针对S102:这里,在待处理网络层的数量为多个时,可以为每个待处理网络层,确定该待处理网络层对应的目标算子和目标分块策略。其中,多种分块策略中的每个分块策略均匹配用于运行目标神经网络的计算设备的运行要求;多个算子中的每个算子均用于实现待处理网络层对应的功能,每个算子可以对应于一种操作或者一个基本网络结构单元。比如,预设的算子例如包括:卷积算子、池化算子、全连接算子等。计算设备为直接处理目标神经网络推理计算的设备,比如,计算设备可以为数字信号处理(Digital Signal Processing,DSP)等。For S102: Here, when the number of network layers to be processed is multiple, a target operator and a target block strategy corresponding to the network layer to be processed may be determined for each network layer to be processed. Wherein, each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network; each of the multiple operators is used to implement the function corresponding to the network layer to be processed , each operator can correspond to an operation or a basic network structure unit. For example, the preset operator includes, for example, a convolution operator, a pooling operator, a fully connected operator, and the like. The computing device is a device that directly processes the inference calculation of the target neural network, for example, the computing device may be a digital signal processing (Digital Signal Processing, DSP) or the like.
一种可能的实施方式中,分块策略用于对待处理网络层对应的目标算子的参数数据进行分块;在多种分块策略中,基于采用目标分块策略对目标算子的参数数据进行分块得到的参数数据,运行待处理网络层的资源消耗最小。In a possible implementation, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed; The parameter data obtained by partitioning, the resource consumption of running the network layer to be processed is minimal.
这里,资源消耗最小可以指运行待处理网络层的运行耗时最小。具体实施时,每个待处理网络层的目标分块策略用于对该待处理网络层对应的目标算子的参数数据进行分块,以使得计算设备基于分块后的参数数据,运行各个待处理网络层的资源消耗最小,比如该资源消耗可以用总计算开销表征,即运行各个待处理网络层的总计算开销最小。其中,算子的参数数据可以包括输入输出数据和常数数据,输入输出数据可以包括输入数据和输出数据,常数数据可以包括权值数据和/或偏差数据。Here, the minimum resource consumption may refer to the minimum running time of running the network layer to be processed. During specific implementation, the target partitioning strategy of each network layer to be processed is used to partition the parameter data of the target operator corresponding to the network layer to be processed, so that the computing device runs each to-be-processed parameter data based on the partitioned parameter data. The resource consumption of processing the network layer is the smallest. For example, the resource consumption can be represented by the total computational cost, that is, the total computational cost of running each network layer to be processed is the smallest. The parameter data of the operator may include input and output data and constant data, the input and output data may include input data and output data, and the constant data may include weight data and/or deviation data.
示例性的,输入数据可以为三维数据,比如可以包括宽度维度、高度维度、输入通道维度;输出数据可以为三维数据,比如可以包括输出宽度维度、输出高度维度、和输出通道维度;权值数据可以为四维数据,比如可以包括宽度维度、高度维度、输入通道维度、输出通道维度;偏差数据可以为一维数据,比如可以包括输出通道维度。其中,上述输入数据、输出数据、权值数据和偏差数据的维度信息可以根据实际情况进行设置,此处仅为示例性说明。Exemplarily, the input data may be three-dimensional data, for example, it may include a width dimension, a height dimension, and an input channel dimension; the output data may be three-dimensional data, for example, it may include an output width dimension, an output height dimension, and an output channel dimension; weight data It can be four-dimensional data, for example, it can include a width dimension, a height dimension, an input channel dimension, and an output channel dimension; the deviation data can be one-dimensional data, for example, it can include an output channel dimension. The dimension information of the above input data, output data, weight data, and deviation data may be set according to actual conditions, and this is only an exemplary description.
具体实施时,在待处理网络层为多个的时,可以针对目标神经网络中的各个待处理网络层,按 照目标神经网络中待处理网络层的顺序,逐层的确定每个待处理网络层的目标算子和目标候选分块策略;或者,也可以随机确定各个待处理网络层中每个待处理网络层的目标算子和目标候选分块策略。比如,在需要确定当前的待处理网络层的输入数据的数据排布是否与设置的目标数据排布方式一致时,则需要用到当前的待处理网络层之前的待处理网络层的输出数据,此时,需要逐层确定每个待处理网络层对应的目标候选算子和目标候选分块策略。During specific implementation, when there are multiple network layers to be processed, each network layer to be processed in the target neural network can be determined layer by layer according to the order of the network layers to be processed in the target neural network. Alternatively, the target operator and target candidate block strategy of each to-be-processed network layer in each to-be-processed network layer may also be randomly determined. For example, when it is necessary to determine whether the data arrangement of the input data of the current network layer to be processed is consistent with the set target data arrangement, it is necessary to use the output data of the network layer to be processed before the current network layer to be processed. At this time, the target candidate operator and target candidate block strategy corresponding to each network layer to be processed need to be determined layer by layer.
一种可选实施方式中,参见图2所示,在所述待处理网络层为多个的情况下,从确定的多个算子和多种分块策略中,确定目标神经网络中待处理网络层对应的目标算子和目标分块策略,包括:In an optional implementation manner, as shown in FIG. 2 , in the case where there are multiple network layers to be processed, from the determined multiple operators and multiple block strategies, determine the pending processing in the target neural network. The target operator and target block strategy corresponding to the network layer, including:
S201,针对目标神经网络中的每个待处理网络层,从多个算子中确定待处理网络层对应目标候选算子、并从多种分块策略中确定与目标候选算子匹配的目标候选分块策略。S201, for each to-be-processed network layer in the target neural network, determine a target candidate operator corresponding to the to-be-processed network layer from a plurality of operators, and determine a target candidate matching the target candidate operator from a variety of blocking strategies Chunking strategy.
S202,在存在任一待处理网络层对应的目标候选算子为多个和/或目标候选分块策略为多个的情况下,基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的目标算子和目标分块策略。S202, when there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators and target candidates corresponding to each network layer to be processed respectively Block strategy, determine the target operator and target block strategy corresponding to each network layer to be processed.
这里,可以先确定每个待处理网络层对应的目标候选算子和与该目标候选算子匹配的目标候选分块策略,实现对每个待处理网络层的目标候选算子和目标候选分块策略的局部优选;进一步的,在存在任一待处理网络层对应的目标候选算子为多个和/或目标候选分块策略为多个的情况下,在综合考虑各个待处理网络层,确定目标神经网络中各个待处理网络层分别对应的目标算子和目标分块策略,实现各个待处理网络层的目标算子和目标分块策略的全局优选。Here, the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy matching the target candidate operator can be determined first, and the target candidate operator and target candidate block of each network layer to be processed can be divided into blocks. Partial optimization of the strategy; further, when there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate block strategies, comprehensively consider each network layer to be processed, determine The target operator and target block strategy corresponding to each to-be-processed network layer in the target neural network respectively realize the global optimization of the target operator and target block strategy of each to-be-processed network layer.
这里,可以预先设置有算子集合和分块策略集合,算子集合中包括设置的全部算子、以及分块策略中包括设置的全部分块策略。为了提高待处理网络层的目标算子和目标分块策略的确定效率,针对每个待处理网络层,可以从算子集合和分块策略集合中,确定该待处理网络层对应的多种算子和多种分块策略。不同的待处理网络层对应的多个算子可以相同、也可以不同;以及不同的待处理网络层对应的多种分块策略可以相同、也可以不同。其中,可以根据实际情况确定每个待处理网络层对应的多个算子和多种分块策略。Here, an operator set and a block strategy set may be preset, and the operator set includes all the set operators, and the block strategy includes all the set block strategies. In order to improve the efficiency of determining the target operator and the target blocking strategy of the network layer to be processed, for each network layer to be processed, a variety of algorithms corresponding to the network layer to be processed can be determined from the set of operators and the set of blocking strategies. Subs and multiple chunking strategies. Multiple operators corresponding to different network layers to be processed may be the same or different; and multiple block strategies corresponding to different network layers to be processed may be the same or different. Among them, multiple operators and multiple block strategies corresponding to each network layer to be processed can be determined according to the actual situation.
示例性的,可以基于历史经验数据,确定每个待处理网络层对应的多个算子和/或多种分块策略。比如,基于历史经验数据,可以确定待处理网络层一对应的多个算子包括算子一、算子二、算子三,对应的多种分块策略包括分块策略一、分块策略二、分块策略四;待处理网络层二对应的多个算子包括算子一、算子三、算子四、算子五,对应的多种分块策略包括分块策略二、分块策略五。Exemplarily, multiple operators and/or multiple block strategies corresponding to each network layer to be processed may be determined based on historical experience data. For example, based on historical experience data, it can be determined that multiple operators corresponding to network layer 1 to be processed include operator 1, operator 2, and operator 3, and the corresponding multiple block strategies include block strategy 1 and block strategy 2 , block strategy 4; multiple operators corresponding to network layer 2 to be processed include operator 1, operator 3, operator 4, and operator 5, and the corresponding multiple block strategies include block strategy 2, block strategy five.
在S201中,一种可选实施方式中,参见图3所示,针对目标神经网络中的每个待处理网络层,从多个算子中确定待处理网络层对应目标候选算子、并从多种分块策略中确定与目标候选算子匹配的目标候选分块策略,包括:In S201, in an optional implementation manner, as shown in FIG. 3, for each network layer to be processed in the target neural network, a target candidate operator corresponding to the network layer to be processed is determined from a plurality of operators, and the target candidate operator corresponding to the network layer to be processed is determined from the multiple operators Determining the target candidate block strategy that matches the target candidate operator among multiple block strategies, including:
S301,针对待处理网络层,从多个算子中,确定一个或多个第一候选算子。S301, for the network layer to be processed, from a plurality of operators, determine one or more first candidate operators.
S302,基于第一候选算子在多种分块策略中的每种分块策略下的资源消耗情况,从第一候选算子、和多种分块策略中,选择待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略。S302, based on the resource consumption of the first candidate operator under each of the multiple segmentation strategies, select one corresponding to the network layer to be processed from the first candidate operator and the multiple segmentation strategies or multiple target candidate operators, and a target candidate block strategy corresponding to the target candidate operators.
这里,可以针对每个待处理网络层,在确定了待处理网络层对应的一个或多个第一候选算子之后,可以基于第一候选算子在每种分块策略下的资源消耗情况,从至少一个第一候选算子、和多种分块策略中,选择待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略,比如,可以选择资源消耗最小的第一候选算子和分块策略,作为目标候选算子和目标候选分块策略,实现了每个待处理网络层对应的目标候选算子和目标候选分块策略的局部优选。Here, for each network layer to be processed, after determining one or more first candidate operators corresponding to the network layer to be processed, based on the resource consumption of the first candidate operator under each block strategy, From at least one first candidate operator and multiple block strategies, select one or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator. For example, you can Select the first candidate operator and the block strategy with the least resource consumption as the target candidate operator and the target candidate block strategy, and realize the local optimization of the target candidate operator and the target candidate block strategy corresponding to each network layer to be processed .
针对S301,针对目标神经网络中的每个待处理网络层,可以从多个算子中,确定该待处理网络层对应的一个或多个第一候选算子。比如,可以根据每个待处理网络层的任务,从多个算子中选择 能够完成该任务的算子,作为该待处理网络层对应的第一候选算子;或者,也可以为用户基于目标神经网络的需求,确定待处理网络层对应的第一候选算子。For S301, for each network layer to be processed in the target neural network, one or more first candidate operators corresponding to the network layer to be processed may be determined from a plurality of operators. For example, according to the task of each network layer to be processed, an operator that can complete the task can be selected from multiple operators as the first candidate operator corresponding to the network layer to be processed; According to the requirements of the neural network, determine the first candidate operator corresponding to the network layer to be processed.
针对S302,可以先确定每个第一候选算子在每种分块策略下的资源消耗情况,再基于第一候选算子在多种分块策略中的每种分块策略下的资源消耗情况,从至少一个候选算子和多个分块策略中,确定该待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略。其中,资源消耗情况为计算设备基于分块策略,运行第一候选算子时资源消耗数据,比如,该资源消耗情况可以用计算开销值表征,该计算开销值表征计算设备运行包含目标算子的待处理网络层时所消耗的时间。For S302, the resource consumption of each first candidate operator under each block strategy may be determined first, and then based on the resource consumption of the first candidate operator under each of the multiple block strategies , from at least one candidate operator and multiple blocking strategies, determine one or more target candidate operators corresponding to the network layer to be processed, and a target candidate blocking strategy corresponding to the target candidate operator. The resource consumption status is the resource consumption data when the computing device runs the first candidate operator based on the block strategy. For example, the resource consumption status can be represented by a computing cost value, which indicates that the computing device runs a program including the target operator. The time elapsed while the network layer is pending.
示例一,若待处理网络层一对应的第一候选算子包括第一候选算子一、第一候选算子二,待处理网络层对应的分块策略包括分块策略一、分块策略二、分块策略三,则针对第一候选算子一,可以计算得到分块策略一对应的计算开销值、分块策略二对应的计算开销值、分块策略一对应的计算开销值,针对第一候选算子二,可以计算得到分块策略一对应的计算开销值、分块策略二对应的计算开销值、分块策略一对应的计算开销值,则可以基于计算得到的6个计算开销值,确定待处理网络层一对应的目标候选算子和目标候选分块策略。Example 1, if the first candidate operator corresponding to the network layer 1 to be processed includes the first candidate operator 1 and the first candidate operator 2, the block strategy corresponding to the network layer to be processed includes the block strategy 1 and the block strategy 2. , Blocking strategy 3, then for the first candidate operator 1, the calculation cost value corresponding to the blocking strategy 1, the computing cost value corresponding to the blocking strategy 2, and the computing cost value corresponding to the blocking strategy 1 can be calculated. A candidate operator 2 can calculate the calculation cost value corresponding to the block strategy 1, the calculation cost value corresponding to the block strategy 2, and the calculation cost value corresponding to the block strategy 1, and then can calculate the calculated cost value based on the six calculated cost values. , and determine the target candidate operator and target candidate block strategy corresponding to the network layer 1 to be processed.
一种可选实施方式中,在得到每个第一候选算子对应的多个计算开销值之后,可以直接利用计算开销值,确定待处理网络层对应的至少一个目标候选算子和目标候选分块策略。In an optional implementation manner, after obtaining multiple calculation cost values corresponding to each first candidate operator, the calculation cost value can be directly used to determine at least one target candidate operator and target candidate score corresponding to the network layer to be processed. block strategy.
比如,可以通过下述两种方式,利用计算开销值,确定待处理网络层对应的至少一个目标候选算子和目标候选分块策略:For example, at least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed can be determined by using the calculation cost value in the following two ways:
方式一中,从计算得到的各个计算开销值中,选择开销值最小的第一候选算子和分块策略,作为待处理网络层对应的标候选算子和目标候选分块策略。In the first mode, from each calculation cost value obtained by calculation, the first candidate operator and the block strategy with the smallest cost value are selected as the target candidate operator and the target candidate block strategy corresponding to the network layer to be processed.
承接上述示例一继续说明,在得到6个计算开销值之后,选择最小的计算开销值,比如,若第一候选算子一在分块策略一下的计算开销值最小,则将第一候选算子一确定为待处理网络层一对应的目标候选算子、将分块策略一确定为待处理网络一对应的目标候选分块策略。Continuing with the above example 1, after obtaining 6 computation cost values, select the smallest computation cost value. The first is determined as the target candidate operator corresponding to the network layer 1 to be processed, and the block strategy 1 is determined as the target candidate block strategy corresponding to the network layer 1 to be processed.
方式二中,可以设置开销阈值,从计算得到的待处理网络层对应的多个计算开销值中,选择计算开销值小于开销阈值的候选开销值,将候选开销值对应的第一候选算子确定为待处理网络层对应的目标候选算子、以及将候选开销值对应的分块策略确定为与目标候选算子匹配的目标分块策略。In the second method, an overhead threshold may be set, and from the calculated multiple calculation overhead values corresponding to the network layer to be processed, a candidate overhead value whose calculation overhead value is less than the overhead threshold value is selected, and the first candidate operator corresponding to the candidate overhead value is determined. The target candidate operator corresponding to the network layer to be processed and the block strategy corresponding to the candidate cost value are determined as the target block strategy matched with the target candidate operator.
承接上述示例一继续说明,在得到6个计算开销值之后,若第一候选算子一在分块策略一下的计算开销值小于设置的开销阈值、以及第一候选算子二在分块策略三下的计算开销值小于设置的开销阈值,则将第一候选算子一确定为待处理网络层一对应的一个目标候选算子、将分块策略一确定为与第一候选算子一匹配的目标候选分块策略;以及将第一候选算子二确定待处理网络层一对应的一个目标候选算子、将分块策略三确定为与第一候选算子三匹配的目标候选分块策略,即确定了待处理网络层一对应的目标候选算子和目标候选分块策略。Continuing with the above example 1, after obtaining 6 computation cost values, if the computation cost value of the first candidate operator 1 under the block strategy is less than the set cost threshold, and the first candidate operator 2 is in the block strategy 3 If the computational cost value below is less than the set cost threshold, then the first candidate operator 1 is determined as a target candidate operator corresponding to the network layer 1 to be processed, and the blocking strategy 1 is determined as the one matching the first candidate operator 1. target candidate block strategy; and the first candidate operator 2 determines a target candidate operator corresponding to the network layer 1 to be processed, and the block strategy 3 is determined as the target candidate block strategy matched with the first candidate operator 3, That is, the target candidate operator and target candidate block strategy corresponding to the network layer 1 to be processed are determined.
另一种实施方式中,S302中,基于第一候选算子在多种分块策略中的每种分块策略下的资源消耗情况,从第一候选算子、和多种分块策略中,选择待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的一个或多个目标候选分块策略,包括:In another embodiment, in S302, based on the resource consumption of the first candidate operator under each of the multiple segmentation strategies, from the first candidate operator and the multiple segmentation strategies, Select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate block strategies corresponding to the target candidate operators, including:
步骤一、从第一候选算子对应的多个所述资源消耗情况中,选取满足预设条件的目标资源消耗情况;其中,一个第一候选算子在一种分块策略下对应一个资源消耗情况。Step 1: Select a target resource consumption situation that satisfies a preset condition from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, a first candidate operator corresponds to a resource consumption under a block strategy Happening.
步骤二、将目标资源消耗情况对应的分块策略,确定为候选分块策略,并基于候选分块策略,运行包含目标资源消耗情况对应的第二候选算子的待处理网络层,确定与候选分块策略及第二候选算子对应的测试结果。Step 2: Determine the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer including the second candidate operator corresponding to the target resource consumption situation, and determine the candidate block strategy. The test result corresponding to the blocking strategy and the second candidate operator.
步骤三、基于测试结果,确定待处理网络层对应的一个或多个目标候选算子以及与目标候选算 子对应的目标候选分块策略。Step 3: Determine one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator based on the test results.
采用上述方法,可以先利用资源消耗情况,从第一候选算子和多种分块策略中,选择第二候选算子、和与第二候选算子匹配的候选分块策略;并对第二候选算子和候选分块策略进行测试,再利用测试结果,确定待处理网络层对应的至少一个目标候选算子和目标候选分块策略,使得确定的待处理网络层对应的至少一个目标候选算子和目标候选分块策略为较优选择。By adopting the above method, the resource consumption situation can be used first to select the second candidate operator and the candidate block strategy matching the second candidate operator from the first candidate operator and multiple block strategies; The candidate operator and the candidate block strategy are tested, and then the test results are used to determine at least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed, so that at least one target candidate operator corresponding to the determined network layer to be processed is determined. The sub and target candidate block strategy is the better choice.
在步骤一中,一个第一候选算子在一种分块策略下对应一个资源消耗情况。比如,第一候选算子一对应的分块策略包括4种时,则第一候选算子对应了4个资源消耗情况。In step 1, a first candidate operator corresponds to a resource consumption situation under a block strategy. For example, when the block strategy corresponding to the first candidate operator includes 4 types, the first candidate operator corresponds to 4 resource consumption situations.
下述以资源消耗情况用计算开销值表征为例进行说明,预设条件可以根据实际需要进行设置。比如,预设条件可以为开销值最小;和/或,预设条件可以为小于设置的开销阈值;和/或,预设条件还可以为选取最小开销值、和与最小开销值之间的差值小于设置的差值阈值的次小开销值。The following description will be given by taking the resource consumption situation as an example to represent the calculation cost value, and the preset conditions can be set according to actual needs. For example, the preset condition may be the minimum overhead value; and/or the preset condition may be less than the set overhead threshold value; and/or the preset condition may also be selecting the minimum overhead value and the difference between the sum and the minimum overhead value The next smallest overhead value whose value is less than the set difference threshold.
比如,若计算得到的第一候选算子在设置的多种分块策略下的计算开销值包括:计算开销值一为10、计算开销值二为12、计算开销值三为18、计算开销值四为20,则可以从多个计算开销值中选择最小开销值,即将计算开销值一确定为目标开销值;或者,可以设置开销阈值15,将计算开销值一和计算开销值二确定为目标开销值;或者,还可以设置差值阈值5,(可知计算开销值二与计算开销值一之间的差值小于设置的差值阈值),将计算开销值一和计算开销值二确定为目标开销值,即该目标开销值对应第二候选算子、以及与第二候选算子匹配的候选分块策略。For example, if the calculated cost values of the first candidate operator under the multiple set partitioning strategies include: cost value one is 10, cost cost two is 12, cost cost three is 18, cost cost value three is 18, and cost value three is 18. Four is 20, then the minimum cost value can be selected from multiple computational cost values, that is, the first computational cost value can be determined as the target cost value; or, the cost threshold can be set to 15, and the computational cost value 1 and the computational cost value 2 can be determined as the target. cost value; alternatively, a difference threshold of 5 can also be set, (it can be seen that the difference between the calculation cost value 2 and the calculation cost value 1 is less than the set difference threshold), and the calculation cost value 1 and the calculation cost value 2 are determined as the target The cost value, that is, the target cost value corresponds to the second candidate operator and the candidate block strategy matching the second candidate operator.
在步骤二中,可以对目标开销值(即目标资源消耗情况)对应的第二候选算子和候选分块策略进行实测,得到各个目标开销值对应的测试结果。即针对每个目标开销值,可以将目标资源消耗情况对应的分块策略,确定为候选分块策略,并基于候选分块策略,运行包含该目标开销值对应的第二候选算子的待处理网络层,确定该目标开销值对应的测试结果,即确定与候选分块策略和第二候选算子对应的测试结果。In step 2, the second candidate operator and the candidate block strategy corresponding to the target overhead value (that is, the target resource consumption condition) may be actually measured, and the test results corresponding to each target overhead value may be obtained. That is, for each target cost value, the block strategy corresponding to the target resource consumption can be determined as a candidate block strategy, and based on the candidate block strategy, the pending operation including the second candidate operator corresponding to the target cost value can be executed. The network layer determines the test result corresponding to the target cost value, that is, determines the test result corresponding to the candidate block strategy and the second candidate operator.
在步骤三中,可以基于测试结果,确定待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略。比如,在测试结果为运行时间时,可以选择运行时间最短对应的第二候选算子,确定为该待处理网络层的目标候选算子,将运行时间最短的第二候选算子对应的候选分块策略,确定为目标候选算子。其中,第一候选算子、第二候选算子可以为能够实现待处理网络层功能的算子。In step 3, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator may be determined based on the test results. For example, when the test result is the running time, the second candidate operator corresponding to the shortest running time can be selected as the target candidate operator of the network layer to be processed, and the candidate corresponding to the second candidate operator with the shortest running time can be divided into The block strategy is determined as the target candidate operator. The first candidate operator and the second candidate operator may be operators capable of realizing the function of the network layer to be processed.
或者,还可以设置运行时间阈值,将小于运行时间阈值的测试结果确定目标测试结果,将目标测试结果对应的第二候选算子确定为目标候选算子,将目标测试结果中目标候选算子对应的候选分块策略,确定为目标候选分块策略。Alternatively, a running time threshold can also be set, a test result smaller than the running time threshold is determined as a target test result, a second candidate operator corresponding to the target test result is determined as a target candidate operator, and a target candidate operator corresponding to the target test result is determined. The candidate block strategy is determined as the target candidate block strategy.
在一种可选实施方式中,资源消耗情况用计算开销值表示,可以根据以下步骤确定第一候选算子在每种分块策略下的计算开销值:In an optional implementation manner, the resource consumption is represented by a calculation cost value, and the calculation cost value of the first candidate operator under each block strategy can be determined according to the following steps:
步骤一、确定第一候选算子在预设尺寸下对应的受限场景,其中,受限场景为基于在预设尺寸下所述第一候选算子对应的数据容量的计算耗时和传输耗时确定的;Step 1. Determine the restricted scene corresponding to the first candidate operator under the preset size, wherein the restricted scene is the calculation time and transmission consumption based on the data capacity corresponding to the first candidate operator under the preset size. time determined;
步骤二、在受限场景属于带宽受限场景的情况下,基于分块策略进行分块的分块结果,确定第一候选算子在分块策略下对应的直接内存操作DMA数据传输总量、DMA任务个数、和数据转换开销;基于DMA数据传输总量、DMA任务个数、数据转换开销、和计算设备对应的DMA速率、DMA任务开销,确定第一候选算子在分块策略下的计算开销值;其中,数据转换开销为按照第一候选算子对应的目标数据排布方式,对第一候选算子对应的输入数据进行数据排布方式转换所消耗的时间;Step 2: In the case where the restricted scenario belongs to the limited bandwidth scenario, based on the partitioning result of the partitioning strategy, determine the total amount of direct memory operation DMA data transmission corresponding to the first candidate operator under the partitioning strategy, The number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device, determine the first candidate operator under the block strategy. Calculate the cost value; wherein, the data conversion cost is the time consumed for converting the input data corresponding to the first candidate operator according to the target data arrangement mode corresponding to the first candidate operator;
步骤三、在受限场景属于计算受限场景的情况下,基于分块策略进行分块的分块结果,确定第一候选算子在分块策略下对应的参数数据的计算耗时、第一候选算子的算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销;基于计算耗时、算子调用次数、初始数据传输总量、数 据转换开销、DMA任务开销、DMA任务个数、和计算设备对应的DMA速率,确定第一候选算子在分块策略下的计算开销值。Step 3: In the case where the restricted scene is a computationally restricted scene, the result of performing the segmentation based on the segmentation strategy is used to determine the calculation time-consuming and the first parameter data corresponding to the first candidate operator under the segmentation strategy. Number of operator calls, total initial data transfer, number of DMA tasks, and data conversion overhead of candidate operators; based on calculation time, number of operator calls, total initial data transfer, data conversion overhead, DMA task overhead, DMA The number of tasks and the DMA rate corresponding to the computing device determine the computing overhead value of the first candidate operator under the block strategy.
在步骤一中,可以确定每个第一候选算子在预设尺寸下对应的受限场景。其中,预设尺寸可以为根据需要设置的、较大的尺寸。In step 1, a restricted scene corresponding to each first candidate operator under a preset size may be determined. The preset size may be a larger size set as required.
在具体实施时,每个待处理网络层对应的目标算子的参数数据可以存储在除计算设备的内存之外的其他外部存储器中,比如,可以存储在双倍速率同步动态随机存储器(Double Data Rate,DDR)中,再运行每个待处理网络层时,DMA可以从DDR中获取该待处理网络层对应的目标算子的参数数据(比如输入数据、常数数据等),并将获取到的参数数据传输到计算设备的内存中,计算设备在计算完成后,DMA再将数据结果(即输出数据)传输到DDR中,以便目标神经网络中与该待处理网络层相邻的下一网络层(该网络层可以为待处理网络层)进行使用。其中,DMA可以使用乒乓调度策略,对获取到的参数数据进行传输。During specific implementation, the parameter data of the target operator corresponding to each network layer to be processed can be stored in other external memory other than the memory of the computing device, for example, can be stored in the double-rate synchronous dynamic random access memory (Double Data Rate, DDR), when running each network layer to be processed, the DMA can obtain the parameter data (such as input data, constant data, etc.) of the target operator corresponding to the network layer to be processed from the DDR, and will The parameter data is transferred to the memory of the computing device. After the computing device completes the calculation, the DMA transfers the data result (ie the output data) to the DDR, so that the next network layer adjacent to the network layer to be processed in the target neural network (The network layer can be the network layer to be processed). The DMA may use the ping-pong scheduling strategy to transmit the acquired parameter data.
由此可知,DMA在传输预设尺寸下的第一候选算子的参数数据时存在一个传输耗时,以及计算设备在对预设尺寸下的第一候选算子的参数数据进行处理时存在一个计算耗时。进而,若计算耗时大于传输耗时,表征在DMA将当前参数数据传输至计算设备的内存上时,计算设备对上一参数数据的处理过程没有结束,此时计算设备需要进行等待,在计算设备对上一参数数据的处理过程结束后,将当前参数数据传输至计算设备的内存上,这种情况对应的场景为计算受限场景;若计算耗时小于传输耗时,表征在计算设备将上一参数数据处理完成后,DMA还未将当前参数数据传输至计算设备的内存上,DMA需要进行等待,直至接收到DMA传输的当前参数数据,这种情况对应的场景可以为带宽受限场景。It can be seen from this that there is a time-consuming transmission when the DMA transmits the parameter data of the first candidate operator under the preset size, and the computing device has a time-consuming time when processing the parameter data of the first candidate operator under the preset size. Computational time. Furthermore, if the calculation time is longer than the transmission time, it means that when the DMA transfers the current parameter data to the memory of the computing device, the processing of the previous parameter data by the computing device has not ended. At this time, the computing device needs to wait. After the device has finished processing the previous parameter data, it transfers the current parameter data to the memory of the computing device. The scenario corresponding to this situation is a computing-limited scenario; if the computing time is less than the transmission time, it means that the computing device will After the previous parameter data processing is completed, the DMA has not yet transferred the current parameter data to the memory of the computing device. The DMA needs to wait until the current parameter data transmitted by the DMA is received. The scenario corresponding to this situation can be a bandwidth-limited scenario. .
进而,在受限场景为带宽受限场景,可以使用带宽受限场景对应的第一开销函数计算开销值;在受限场景为计算受限场景,可以使用计算受限场景对应的第二开销函数计算开销值。Furthermore, when the limited scene is a bandwidth-limited scene, the cost value can be calculated by using the first cost function corresponding to the bandwidth-limited scene; when the limited scene is a calculation-limited scene, the second cost function corresponding to the calculation-limited scene can be used. Calculate the cost value.
示例性,可以根据下述过程确定第一候选算子在预设尺寸下对应的受限场景:针对在预设尺寸下的第一候选算子,确定传输预设尺寸下的第一候选算子对应的参数数据需要的传输耗时,以及确定计算设备对预设尺寸下的第一候选算子的参数数据进行处理的计算耗时,根据传输耗时和计算耗时的大小,确定第一候选算子对应的受限场景。Exemplarily, the restricted scene corresponding to the first candidate operator under the preset size may be determined according to the following process: for the first candidate operator under the preset size, determine to transmit the first candidate operator under the preset size. The transmission time required for the corresponding parameter data, and the calculation time required for determining the computing device to process the parameter data of the first candidate operator under the preset size, and the first candidate is determined according to the transmission time and the calculation time. The restricted scenario corresponding to the operator.
示例性的,还可以根据下述过程确定第一候选算子在预设尺寸下的受限场景:第一,基于预设尺寸信息,确定计算设备基于第一候选算子对应的参数数据运行对应的待处理网络层所需的目标耗时、以及确定第一候选算子对应的参数数据的目标数据容量。第二,基于计算设备对应的DMA速率、和目标耗时,确定目标耗时内DMA可传输的数据容量。第三,基于目标耗时内DMA可传输的数据容量、与目标数据容量的比值,确定受限场景,即在比值小于或等于1时,确定受限场景为带宽受限场景;在比值大于1时,确定受限场景为计算受限场景。Exemplarily, the restricted scene of the first candidate operator under the preset size can also be determined according to the following process: first, based on the preset size information, it is determined that the computing device runs the corresponding operation based on the parameter data corresponding to the first candidate operator. The target time-consuming required by the network layer to be processed is determined, and the target data capacity of the parameter data corresponding to the first candidate operator is determined. Second, based on the DMA rate corresponding to the computing device and the target time-consuming, determine the data capacity that can be transmitted by the DMA within the target time-consuming. Third, based on the ratio of the data capacity that can be transmitted by DMA to the target data capacity within the target time, determine the restricted scenario, that is, when the ratio is less than or equal to 1, the restricted scenario is determined to be a bandwidth-restricted scenario; when the ratio is greater than 1 When , the restricted scene is determined to be a computationally restricted scene.
这里,目标耗时内DMA可传输的数据容量与传输速度相关,目标数据容量与计算速度相关,在比值大于1时,表征传输速度大于计算速度(即传输耗时小于计算耗时),即为计算受限场景;在比值小于或等于1时,表征传输速度小于或等于计算速度(即传输耗时大于或等于计算耗时),即为带宽受限场景,进而后续针对不同的受限场景,可以选择不同的方式确定计算开销值。Here, the data capacity that can be transmitted by DMA in the target time is related to the transmission speed, and the target data capacity is related to the calculation speed. When the ratio is greater than 1, it indicates that the transmission speed is greater than the calculation speed (that is, the transmission time is less than the calculation time), which is Computation-limited scenarios; when the ratio is less than or equal to 1, it indicates that the transmission speed is less than or equal to the calculation speed (that is, the transmission time is greater than or equal to the calculation time), which is a bandwidth-limited scenario, and then for different restricted scenarios, You can choose different ways to determine the computational overhead value.
可以基于第一候选算子的参数数据的预设尺寸信息,确定第一候选算子的参数数据在计算设备上对应的目标耗时,即确定计算设备在基于第一候选算子对应的参数数据运行对应的待处理网络时,所需的目标耗时。进而可以将计算设备对应的DMA速率和目标耗时相乘,得到目标耗时内DMA可传输的数据容量。The target time consumption corresponding to the parameter data of the first candidate operator on the computing device can be determined based on the preset size information of the parameter data of the first candidate operator, that is, it is determined that the computing device is based on the parameter data corresponding to the first candidate operator. The desired target time-consuming when running the corresponding pending network. Then, the DMA rate corresponding to the computing device and the target time-consuming can be multiplied to obtain the data capacity that can be transmitted by the DMA within the target time-consuming.
同时,可以基于第一候选算子的参数数据的预设尺寸信息,确定第一候选算子对应的参数数据的目标数据容量。比如,在第一候选算子为卷积算子时,该目标数据容量可以为常数数据(包括权 值数据和偏差数据)、输出数据和输入数据的总和。再可以基于计算的目标耗时内DMA可传输的数据容量、与目标数据容量的比值,确定受限场景。Meanwhile, the target data capacity of the parameter data corresponding to the first candidate operator may be determined based on the preset size information of the parameter data of the first candidate operator. For example, when the first candidate operator is a convolution operator, the target data capacity may be the sum of constant data (including weight data and bias data), output data and input data. The restricted scenario can then be determined based on the ratio of the data capacity that can be transmitted by the DMA within the calculated target time-consuming to the target data capacity.
具体实施时,在确定了计算设备之后,可以确定该计算设备对应的DMA任务开销,单位为秒(s),比如,可以将每创建一个DMA任务需要消耗的周期cycle,转换为时间,即得到DMA任务开销;以及可以确定DMA速率,即DMA的传输速率,单位为Byte/s。In specific implementation, after the computing device is determined, the DMA task overhead corresponding to the computing device can be determined, in seconds (s). For example, the cycle required to create a DMA task can be converted into time, that is, DMA task overhead; and the DMA rate, that is, the DMA transfer rate, can be determined in Byte/s.
在步骤二中,可以使用第一开销函数确定第一候选算子在分块策略下的计算开销值。第一开销函数可以为:计算开销=DMA数据传输总量/DMA速率+DMA任务个数×DMA任务开销+数据转换开销。In step 2, the calculation cost value of the first candidate operator under the block strategy may be determined by using the first cost function. The first overhead function may be: calculation overhead=total amount of DMA data transmission/DMA rate+number of DMA tasks×DMA task overhead+data conversion overhead.
即在确定属于带宽受限场景时,可以基于分块结果,确定第一候选算子在分块策略下对应的DMA数据传输总量(单位为Byte),DMA任务个数、和数据转换开销(单位为秒s)。其中,DMA数据传输总量可以根据生成的DMA任务进行确定;DMA任务个数可以在基于分块策略对参数数据进行分块后,基于得到的参数数据的数据块的数量进行确定;比如,在一个数据块对应一个DMA任务、生成的数据块的数量为10个时,则确定DMA的任务有10个。这里,DMA数据传输总量、和DMA任务个数可以根据实际情况进行确定,此处仅为示例性说明。比如,在第一候选算子为卷积算子时,可以根据卷积算子对应的卷积核尺寸、卷积步长等卷积参数,确定分块结果后得到的DMA任务个数。That is, when it is determined that the bandwidth is limited, the total amount of DMA data transmission (unit is Byte), the number of DMA tasks, and the data conversion overhead ( The unit is seconds). Among them, the total amount of DMA data transmission can be determined according to the generated DMA tasks; the number of DMA tasks can be determined based on the number of data blocks of the obtained parameter data after the parameter data is divided into blocks based on the block strategy; When one data block corresponds to one DMA task and the number of generated data blocks is 10, it is determined that there are 10 DMA tasks. Here, the total amount of DMA data transmission and the number of DMA tasks may be determined according to actual conditions, and this is only an exemplary description. For example, when the first candidate operator is a convolution operator, the number of DMA tasks obtained after the block result can be determined according to convolution parameters such as the convolution kernel size and convolution stride corresponding to the convolution operator.
数据转换开销为按照第一候选算子对应的目标数据排布方式,对第一候选算子对应的输入数据进行数据排布方式转换所消耗的时间。这里,第一算子的输入数据的数据排布、与第一算子对应的目标数据排布方式一致时,数据转换开销为0;在第一算子的输入数据的数据排布、与第一算子对应的目标数据排布方式不一致时,可以根据下述公式计算数据转换开销:数据转换开销=输入数据的总数据容量×2/DMA速率。其中,输入数据的总数据容量为分块之前、输入至待处理网络层的全部输入数据。The data conversion overhead is the time consumed for converting the data layout of the input data corresponding to the first candidate operator according to the target data layout corresponding to the first candidate operator. Here, when the data arrangement of the input data of the first operator is consistent with the arrangement of the target data corresponding to the first operator, the data conversion overhead is 0; When the target data arrangement modes corresponding to an operator are inconsistent, the data conversion overhead can be calculated according to the following formula: data conversion overhead=total data capacity of input data×2/DMA rate. Among them, the total data capacity of input data is all input data input to the network layer to be processed before being divided into blocks.
在步骤三中,在确定属于计算受限场景时,可以根据第二开销函数计算第一候选算子在分块策略下的计算开销值。第二开销函数为:计算开销=算子开销折算宽带×算子调用次数/DMA速率+初始数据传输总量/DMA速率+DMA任务个数×DMA任务开销+数据转换开销。In step 3, when it is determined that the calculation is limited, the calculation cost value of the first candidate operator under the block strategy may be calculated according to the second cost function. The second overhead function is: calculation overhead=operator overhead converted bandwidth×operator invocation times/DMA rate+initial data transmission total amount/DMA rate+DMA task number×DMA task overhead+data conversion overhead.
其中,算子开销折算宽度为基于第一候选算子在预设尺寸下的计算耗时、和第一选算子在分块策略下对应的参数数据的尺寸,确定的算子传输数据量。比如,在预设尺寸为1024×1024×128、第一候选算子在预设尺寸下的计算耗时为10毫秒,分块后的参数数据的尺寸为512×512×64,则第一候选算子在所述分块策略下对应的参数数据的计算耗时为1.25毫秒。再可以基于确定的计算速度、和第一候选算子在所述分块策略下对应的参数数据的计算耗时(比如1.25毫秒),确定分块后的第一候选算子对应的算子开销折算带宽。The operator overhead conversion width is the amount of operator transmission data determined based on the calculation time of the first candidate operator under the preset size and the size of the parameter data corresponding to the first selected operator under the block strategy. For example, when the preset size is 1024×1024×128, the calculation time of the first candidate operator under the preset size is 10 milliseconds, and the size of the segmented parameter data is 512×512×64, the first candidate operator The calculation time of the parameter data corresponding to the operator under the block strategy is 1.25 milliseconds. Then, based on the determined calculation speed and the calculation time (for example, 1.25 milliseconds) of the parameter data corresponding to the first candidate operator under the block strategy, determine the operator overhead corresponding to the first candidate operator after the block. Converted bandwidth.
具体的,可以基于分块结果,确定第一候选算子在分块策略下对应的参数数据的计算耗时、第一候选算子的算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销。其中,算子调用次数可以在基于分块策略对参数数据进行分块后,基于得到的参数数据的数据块的数量进行确定;比如,若得到的参数数据的数据块的数量为10个,则确定算子调用次数为10次;初始数据传输总量为基于分块策略确定的初始数据的数据容量;其中,目标数据容量、算子调用次数、初始数据传输总量,可以根据实际情况进行确定。Specifically, the calculation time of the parameter data corresponding to the first candidate operator under the segmentation strategy, the number of operator calls of the first candidate operator, the total amount of initial data transmission, and the number of DMA tasks can be determined based on the segmentation result. , and data conversion overhead. The number of operator calls can be determined based on the number of data blocks of the obtained parameter data after the parameter data is divided into blocks based on the block strategy; for example, if the number of data blocks of the obtained parameter data is 10, then The number of operator calls is determined to be 10 times; the total amount of initial data transmission is the data capacity of the initial data determined based on the block strategy; the target data capacity, the number of operator calls, and the total amount of initial data transmission can be determined according to the actual situation .
其中,步骤二和步骤三中,基于分块结果,可以得到第一候选算子的对齐后的参数数据对应的目标数据容量、算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销。Wherein, in step 2 and step 3, based on the block result, the target data capacity corresponding to the aligned parameter data of the first candidate operator, the number of operator calls, the total amount of initial data transmission, the number of DMA tasks, and Data conversion overhead.
步骤三中的数据转换开销的确定过程、与步骤二中的数据转换开销的确定过程相同,此处不再进行具体说明。本公开实施例中主要可以应用于带宽受限场景中,即在满足带宽受限场景时,使用 步骤二确定计算开销值;在不满足带宽受限场景时(即满足计算开销时),可以利用步骤三确定计算开销值。The process of determining the data conversion overhead in step 3 is the same as the process of determining the data conversion cost in step 2, and will not be described in detail here. The embodiment of the present disclosure can be mainly applied to the bandwidth-limited scenario, that is, when the bandwidth-limited scenario is satisfied, use step 2 to determine the calculation cost value; when the bandwidth-limited scenario is not satisfied (ie, when the computational cost is satisfied), the Step 3 determines the calculation cost value.
上述实施方式中,可以确定第一候选算子在预设尺寸下对应的受限场景,不同的受限场景对应不同计算开销值确定方式。比如,在带宽受限场景下,可以基于DMA数据传输总量、DMA任务个数、数据转换开销、DMA速率、DMA任务开销,确定计算开销值;在计算受限场景下,可以基于计算耗时、算子调用次数、初始数据传输总量、数据转换开销、和DMA速率,确定计算开销值。In the above embodiment, the restricted scenarios corresponding to the first candidate operator under the preset size may be determined, and different restricted scenarios correspond to different calculation cost value determination methods. For example, in a bandwidth-constrained scenario, the calculation overhead value can be determined based on the total amount of DMA data transmission, the number of DMA tasks, data conversion overhead, DMA rate, and DMA task overhead; , the number of operator calls, the total amount of initial data transfer, the data conversion overhead, and the DMA rate to determine the calculation overhead value.
在一种可选实施方式中,从第一候选算子、和多种分块策略中,选择待处理网络层对应的一个或多个目标候选算子、以及与目标候选算子对应的目标候选分块策略之前,该方法还包括:In an optional implementation manner, one or more target candidate operators corresponding to the network layer to be processed and target candidate operators corresponding to the target candidate operators are selected from the first candidate operator and multiple blocking strategies Before the chunking strategy, the method also includes:
基于确定的目标神经网络对应的最小颗粒度信息,对第一候选算子对应的参数数据进行对齐操作,得到第一候选算子对应的对齐后的参数数据;其中,最小颗粒度信息中包括参数数据在不同维度下对应的最小颗粒度;对齐后的参数数据在不同维度下的尺寸,是最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍。Based on the determined minimum granularity information corresponding to the target neural network, an alignment operation is performed on the parameter data corresponding to the first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator; wherein, the minimum granularity information includes parameters The minimum granularity corresponding to the data in different dimensions; the size of the aligned parameter data in different dimensions is an integer multiple of the minimum granularity in the corresponding dimension indicated by the minimum granularity information.
这里,最小颗粒度信息中包括参数数据在不同维度下对应的最小颗粒度,比如,参数数据包括权值数据时,权值数据对应的最小颗粒度信息包括宽度维度上的最小颗粒度、长度维度上的最小颗粒度、输入通道维度上的最小颗粒度、和输出通道维度上的最小颗粒度。其中,最小颗粒度信息可以根据计算设备的运行需求、和/或用户需求进行确定,此处仅为示例性说明。Here, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions. For example, when the parameter data includes weight data, the minimum granularity information corresponding to the weight data includes the minimum granularity in the width dimension and the length dimension. Minimum granularity on the input channel dimension, and minimum granularity on the output channel dimension. Wherein, the minimum granularity information may be determined according to the operation requirement of the computing device and/or the user requirement, which is only an exemplary description here.
可以利用确定的目标神经网络对应的最小颗粒度信息,对每个第一候选算子对应的参数数据进行对齐操作,得到第一候选算子对应的对齐后的参数数据,使得对齐后的参数数据在不同维度下的尺寸,是最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍。比如,若最小颗粒度信息指示的宽度维度的尺寸为32,在参数数据指示的宽度维度的尺寸为33时,则生成的对齐后的参数数据的宽度维度的尺寸为64;在参数数据指示的宽度维度的尺寸为31时,则生成的对齐后的参数数据的宽度维度的尺寸为32。The minimum granularity information corresponding to the determined target neural network can be used to align the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator, so that the aligned parameter data The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information. For example, if the size of the width dimension indicated by the minimum granularity information is 32, and the size of the width dimension indicated by the parameter data is 33, the size of the width dimension of the generated aligned parameter data is 64; When the size of the width dimension is 31, the size of the width dimension of the generated aligned parameter data is 32.
其中,对齐操作的具体过程可以根据实际需要进行选择。比如,可以使用常规的数据对齐方式(比如padding方式等),对参数数据进行对齐操作,生成对齐后的参数数据。The specific process of the alignment operation can be selected according to actual needs. For example, conventional data alignment methods (such as padding methods, etc.) can be used to align parameter data to generate aligned parameter data.
在另一种实施方式中,还可以计算设备还可以从DDR中获取对齐前的参数数据,使用garbage数据计算,然后在计算设备输出的数据中选取有效数据,将有效数据作为输出数据输入到DDR中。In another embodiment, the computing device can also obtain the parameter data before alignment from the DDR, use garbage data to calculate, and then select valid data from the data output by the computing device, and input the valid data as output data to the DDR middle.
这里,可以基于目标神经网络对应的最小颗粒度信息,对每个第一候选算子对应的参数数据进行对齐操作,得到第一候选算子对应的对齐后的参数数据,对齐后的参数数据在不同维度下的尺寸,是最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍,降低后续基于目标分块策略,运行目标神经网络时,造成参数数据丢失的情况的发生概率。Here, based on the minimum granularity information corresponding to the target neural network, an alignment operation can be performed on the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator. The aligned parameter data is in The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information, which reduces the probability of loss of parameter data when running the target neural network based on the target block strategy.
针对S202:一种可选实施方式中,S202中,基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的目标算子和目标分块策略,包括:For S202: In an optional implementation manner, in S202, the target operator and the target block corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed respectively strategies, including:
S2021,基于各个待处理网络层分别对应的目标候选算子、和与目标候选算子对应的目标候选分块策略,确定目标神经网络对应的多个测试网络;其中,每个测试网络中包括各个待处理网络层对应的一个目标候选算子、和与该目标候选算子匹配的一个目标候选分块策略。S2021: Determine multiple test networks corresponding to the target neural network based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, each test network includes each A target candidate operator corresponding to the network layer to be processed, and a target candidate block strategy matching the target candidate operator.
S2022,分别运行多个测试网络,得到多个测试结果,其中,每个测试网络对应一个测试结果。S2022, run multiple test networks respectively to obtain multiple test results, wherein each test network corresponds to one test result.
S2023,基于多个测试结果,从多个测试网络中选取目标测试网络。S2023, based on multiple test results, select a target test network from multiple test networks.
S2024,将目标测试网络中待处理网络层的目标候选算子和目标候选分块策略,确定为目标神经网络中待处理网络层分别对应的目标算子和目标分块策略。S2024: Determine the target candidate operator and the target candidate block strategy of the network layer to be processed in the target test network as the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network respectively.
在S2021中,示例性的,若目标神经网络包括第一待处理网络层、第二待处理网络层和第三待处理网络层,第一待处理网络层中包括目标候选算子一、目标候选算子一对应的分块策略一,以及 目标候选算子二、目标候选算子二对应的分块策略二;第二待处理网络层中包括目标候选算子三、目标候选算子对应的分块策略一,以及目标候选算子四、目标候选算子四对应的分块策略一;第三待处理网络层中包括目标候选算子五、目标候选算子五对应的分块策略三。In S2021, for example, if the target neural network includes a first network layer to be processed, a second network layer to be processed and a third network layer to be processed, the first network layer to be processed includes target candidate operator 1, target candidate The block strategy 1 corresponding to operator 1, and the block strategy 2 corresponding to target candidate operator 2 and target candidate operator 2; the second network layer to be processed includes target candidate operator 3 and target candidate operator 3. Block strategy 1, target candidate operator 4, and block strategy 1 corresponding to target candidate operator 4; the third to-be-processed network layer includes target candidate operator 5 and block strategy 3 corresponding to target candidate operator 5.
进而可以得到目标神经网络对应的四个测试网络,第一测试网络中包括:目标候选算子一、目标候选算子一对应的分块策略一,目标候选算子三、目标候选算子对应的分块策略一,目标候选算子五、目标候选算子五对应的分块策略三。第二测试网络中包括:目标候选算子一、目标候选算子一对应的分块策略一,目标候选算子四、目标候选算子四对应的分块策略一,目标候选算子五、目标候选算子五对应的分块策略三。第三测试网络中包括:目标候选算子二、目标候选算子二对应的分块策略二,目标候选算子三、目标候选算子对应的分块策略一,目标候选算子五、目标候选算子五对应的分块策略三。第四测试网络中包括:目标候选算子二、目标候选算子二对应的分块策略二,目标候选算子四、目标候选算子四对应的分块策略一,目标候选算子五、目标候选算子五对应的分块策略三。Then, four test networks corresponding to the target neural network can be obtained. The first test network includes: target candidate operator 1, target candidate operator 1 corresponding block strategy 1, target candidate operator 3, target candidate operator corresponding to Block strategy 1, target candidate operator 5, and block strategy 3 corresponding to target candidate operator 5. The second test network includes: target candidate operator 1, block strategy 1 corresponding to target candidate operator 1, target candidate operator 4, block strategy 1 corresponding to target candidate operator 4, target candidate operator 5, target Blocking strategy 3 corresponding to candidate operator 5. The third test network includes: target candidate operator 2, block strategy 2 corresponding to target candidate operator 2, target candidate operator 3, block strategy 1 corresponding to target candidate operator, target candidate operator 5, target candidate The block strategy 3 corresponding to operator 5. The fourth test network includes: target candidate operator 2, block strategy 2 corresponding to target candidate operator 2, target candidate operator 4, block strategy 1 corresponding to target candidate operator 4, target candidate operator 5, target Blocking strategy 3 corresponding to candidate operator 5.
在S2022和S2023中,可以控制计算设备分别运行多个测试网络,确定每个测试网络的测试结果。比如,测试结果可以为每个测试网络对应的运行时间。再可以基于多个测试网络对应的测试结果,从多个测试网络中选取目标测试网络。比如,可以选取运行时间最短的测试网络,作为目标测试网络。In S2022 and S2023, the computing device may be controlled to run a plurality of test networks respectively, and a test result of each test network may be determined. For example, the test result can be the corresponding running time of each test network. Then, based on the test results corresponding to the multiple test networks, the target test network can be selected from the multiple test networks. For example, the test network with the shortest running time can be selected as the target test network.
在S2024中,可以将目标测试网络中包括的各个待处理网络层的目标候选算子和目标候选分块策略,确定为目标神经网络中各个待处理网络层分别对应的目标算子和目标分块策略。In S2024, the target candidate operator and target candidate block strategy of each network layer to be processed included in the target test network may be determined as the target operator and target block corresponding to each network layer to be processed in the target neural network respectively Strategy.
比如,若确定第二测试网络为目标测试网络,则确定目标候选算子一为第一待处理网络层的目标算子,分块策略一为第一待处理网络层对应的目标分块策略;目标候选算子四为第二待处理网络层的目标算子,分块策略一为第二待处理网络层的目标候选分块策略;目标候选算子五为第三代理网络层的目标算子,分块策略三为第三待处理网络层的目标候选分块策略。For example, if it is determined that the second test network is the target test network, then the target candidate operator 1 is determined to be the target operator of the first network layer to be processed, and the block strategy 1 is the target block strategy corresponding to the first network layer to be processed; Target candidate operator 4 is the target operator of the second network layer to be processed, block strategy 1 is the target candidate block strategy of the second network layer to be processed; target candidate operator 5 is the target operator of the third proxy network layer , and the third block strategy is the target candidate block strategy of the third to-be-processed network layer.
为了减少运行测试网络所消耗的成本和运算资源,提高目标算子和目标分块策略的确定效率,在具体实施时,可以设置每个待处理网络层对应的匹配有目标分块策略的目标算子的最大数量。比如,设置的最大数量为2时,则每个待处理网络层可以包括一个匹配有目标分块策略的目标算子,比如,匹配有目标分块策略一的目标算子一;或者,每个待处理网络层可以包括两个匹配有目标分块策略的目标算子,比如,两个匹配有目标分块策略的目标算子可以为:匹配有目标分块策略一的目标算子一、匹配有目标分块策略二的目标算子一;或者,两个匹配有目标分块策略的目标算子可以为:匹配有目标分块策略一的目标算子一、匹配有目标分块策略二的目标算子二;再或者,两个匹配有目标分块策略的目标算子可以为:匹配有目标分块策略一的目标算子一、匹配有目标分块策略一的目标算子二等。In order to reduce the cost and computing resources consumed by running the test network, and improve the determination efficiency of target operators and target block strategy, in the specific implementation, you can set the target algorithm that matches the target block strategy corresponding to each network layer to be processed. Maximum number of children. For example, when the maximum number is set to 2, each network layer to be processed can include a target operator matching the target block strategy, for example, target operator 1 matching the target block strategy 1; or, each The network layer to be processed may include two target operators that match the target block strategy. For example, the two target operators that match the target block strategy can be: target operators that match the target block strategy 1. Match Target operator 1 with target blocking strategy 2; or, two target operators matching with target blocking strategy can be: target operator 1 matching with target blocking strategy 1, and matching target blocking strategy 2 with target operator 1 The second target operator; alternatively, the two target operators matched with the target block strategy may be: target operator 1 matched with the target block strategy 1, target operator 2 matched with the target block strategy 1, and so on.
和/或,在具体实施时,可以设置目标神经网络对应的测试网络的数量阈值。比如,设置的数量阈值为100,待处理网络层包括10层,若第一层待处理网络层至第六层待处理网络层对应的匹配有目标分块策略的目标算子的数量均为2,则在第一层待处理网络层至第六待处理网络层中,基于每个待处理网络层对应的目标算子和目标分块策略,构成的局部测试网络的数量可以为2 6=64个。进而,在确定第七层待处理网络层的目标算子和目标分块策略时,若第七层待处理网络层对应的匹配有目标分块策略的目标算子的数量为2个时,则在第一层待处理网络层至第七待处理网络层中,基于每个待处理网络层对应的目标算子和目标分块策略,构成的局部测试网络的数量可以为2 7=128个,大于设置的数量阈值;在这种情况下,第七层待处理网络层、第八层待处理网络层、第九层待处理网络层、以及第十层待处理网络层中,每层待处理网络层对应的匹配有目标分块策略的目标算子的数量仅可以为1个。 And/or, during specific implementation, a threshold for the number of test networks corresponding to the target neural network may be set. For example, the set number threshold is 100, and the network layer to be processed includes 10 layers. If the number of target operators matching the target block strategy corresponding to the network layer to be processed from the first layer to the network layer to be processed is 2 , then in the first to sixth to-be-processed network layers, based on the target operator and target block strategy corresponding to each to-be-processed network layer, the number of formed local test networks can be 2 6 =64 indivual. Further, when determining the target operator and the target block strategy of the seventh layer to be processed network layer, if the number of target operators matched with the target block strategy corresponding to the seventh layer to be processed network layer is 2, then In the first network layer to be processed to the seventh network layer to be processed, based on the target operator and target block strategy corresponding to each network layer to be processed, the number of formed local test networks may be 2 7 =128, is greater than the set number threshold; in this case, among the seventh unprocessed network layer, the eighth unprocessed network layer, the ninth unprocessed network layer, and the tenth unprocessed network layer, each unprocessed The number of target operators that match the target block strategy corresponding to the network layer can only be one.
上述实施方式中,通过基于各个待处理网络层对应的至少一个目标候选算子、和与目标候选算子对应的目标候选分块策略,确定目标神经网络对应的多个测试网络;再利用计算设备运行多个测试网络,确定每个测试网络的测试结果;基于测试结果,确定目标测试网络,比如,在测试结果为计算开销时,可以选择计算开销最小的测试网络作为目标测试网络,将目标测试网络中各个待处理网络层的目标候选算子和目标候选分块策略,确定为目标神经网络中各个待处理网络层分别对应的目标算子和目标分块策略,实现了目标算子和目标分块策略的全局优选。In the above embodiment, multiple test networks corresponding to the target neural network are determined based on at least one target candidate operator corresponding to each network layer to be processed and the target candidate block strategy corresponding to the target candidate operator; Run multiple test networks and determine the test results of each test network; determine the target test network based on the test results. For example, when the test result is computational cost, you can select the test network with the least computational cost as the target test network, and use the target test network The target candidate operator and target candidate block strategy of each network layer to be processed in the network are determined as the target operator and target block strategy corresponding to each network layer to be processed in the target neural network, and the target operator and target partition strategy are realized. Global preference for block strategy.
一种可选实施方式中,在指定维度为一维时,维度参数为第一维度;在指定维度为N维时,维度参数包括第一维度至第N维度,N为大于2且小于常数数据或输入数据的维度。在参数数据包括输入数据和常数数据的情况下,多种分块策略包括以下至少一种:In an optional embodiment, when the specified dimension is one dimension, the dimension parameter is the first dimension; when the specified dimension is N dimension, the dimension parameter includes the first dimension to the Nth dimension, and N is greater than 2 and less than constant data. or the dimension of the input data. In the case where the parameter data includes input data and constant data, the multiple chunking strategies include at least one of the following:
方案一、将全部输入数据作为初始数据,基于确定的常数数据的第一维度,对常数数据进行一维分块,得到分块结果;初始数据为计算设备运行目标神经网络时,写入直接内存操作DMA任务所分配的初始数据区域内的数据。Scheme 1. Take all input data as initial data, and perform one-dimensional block on the constant data based on the first dimension of the determined constant data to obtain the block result; the initial data is written to the direct memory when the computing device runs the target neural network. Operates the data in the initial data area allocated by the DMA task.
方案二、将全部输入数据作为初始数据,基于确定的常数数据的第一维度、和第二维度,对常数数据进行二维分块,得到分块结果。Scheme 2: Using all input data as initial data, and based on the determined first dimension and second dimension of the constant data, perform two-dimensional block on the constant data to obtain a block result.
方案三、将全部常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到分块结果。Scheme 3: Use all constant data as initial data, and perform one-dimensional block on the input data based on the determined first dimension of the input data to obtain a block result.
方案四、将全部常数数据作为初始数据,基于确定的输入数据的第一维度和第二维度,对输入数据进行二维分块,得到分块结果。Scheme 4: Using all constant data as initial data, and based on the determined first dimension and second dimension of the input data, perform two-dimensional block on the input data to obtain a block result.
方案五、将部分输入数据作为初始数据,基于确定的常数数据的第一维度,对常数数据进行一维分块,得到分块结果;其中,部分输入数据的目标尺寸为根据输入数据的第一维度的最小颗粒度确定。Scheme 5: Taking part of the input data as the initial data, and based on the determined first dimension of the constant data, one-dimensional block the constant data to obtain the block result; wherein, the target size of the part of the input data is based on the first dimension of the input data. The minimum granularity of the dimension is determined.
方案六、将部分输入数据作为初始数据,基于确定的常数数据的第一维度和第二维度,对常数数据进行二维分块,得到分块结果;其中,部分输入数据的目标尺寸为根据输入数据的第一维度的最小颗粒度确定。Option 6: Taking part of the input data as the initial data, and based on the determined first dimension and second dimension of the constant data, perform two-dimensional block on the constant data to obtain a block result; wherein, the target size of part of the input data is based on the input data. The minimum granularity of the first dimension of the data is determined.
方案七、将部分常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到分块结果;其中,部分常数数据的目标尺寸为根据常数数据的第一维度的最小颗粒度确定的。Scheme 7: Use part of the constant data as initial data, and based on the determined first dimension of the input data, perform one-dimensional block on the input data to obtain a block result; wherein, the target size of the part of the constant data is the first dimension according to the constant data. The minimum granularity of the dimension is determined.
方案八、将部分常数数据作为初始数据,基于确定的输入数据的第一维度和第二维度,对输入数据进行二维分块,得到分块结果;其中,部分常数数据的目标尺寸为根据常数数据的第一维度的最小颗粒度确定的。Scheme 8: Using part of the constant data as initial data, and based on the determined first dimension and second dimension of the input data, perform a two-dimensional block on the input data to obtain a block result; wherein, the target size of the part of the constant data is based on the constant The minimum granularity of the first dimension of the data is determined.
这里,可以将全部的输入数据作为初始数据,将初始数据申请空间allocate在初始数据区域内。再基于确定的常数数据的第一维度,对常数数据进行一维分块,得到分块结果。或者,基于确定的常数数据的第一维度和第二维度,对常数数据进行二维分块,得到分块结果。Here, all input data can be used as initial data, and the initial data application space can be allocated in the initial data area. Then, based on the determined first dimension of the constant data, one-dimensional block is performed on the constant data to obtain a block result. Alternatively, based on the determined first dimension and the second dimension of the constant data, two-dimensional block is performed on the constant data to obtain a block result.
可以将全部的常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到分块结果。或者,基于确定的输入数据的第一维度和第二维度,对输入数据进行二维分块,得到分块结果。All constant data can be used as initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data to obtain a block result. Alternatively, based on the determined first dimension and second dimension of the input data, two-dimensional block is performed on the input data to obtain a block result.
还可以将部分的输入数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到分块结果;或者,基于确定的输入数据的第一维度和第二维度,对输入数据进行二维分块,得到分块结果。Part of the input data can also be used as the initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data to obtain a block result; or, based on the determined first dimension and the second dimension of the input data , the input data is divided into two-dimensional blocks, and the block results are obtained.
一种可选实施方式中,方案五或方案六中,将部分输入数据作为初始数据,基于确定的常数数据的维度参数,对常数数据进行指定维度的分块,得到分块结果,包括:In an optional embodiment, in Scheme 5 or Scheme 6, a part of the input data is used as initial data, and based on the dimension parameter of the determined constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including:
一、基于输入数据的第一维度的最小颗粒度的i倍,确定部分输入数据的目标尺寸。1. Determine the target size of part of the input data based on i times the minimum granularity of the first dimension of the input data.
二、分别将目标尺寸的部分输入数据作为初始数据,基于确定的常数数据的维度参数,对常数数据进行指定维度的分块,得到分块结果。2. Part of the input data of the target size is respectively used as the initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of the specified dimensions, and the block result is obtained.
其中,i为在确定了部分输入数据的目标尺寸后,使得部分输入数据的数据容量,以及基于常数数据的维度参数的最小颗粒度,确定的常数数据块的数据容量,满足计算设备的内存要求的正整数。Among them, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the minimum granularity of the dimension parameter based on the constant data, the determined data capacity of the constant data block, which meets the memory requirements of the computing device positive integer of .
这里,可以通过递增的方式,确定i的最大值。下述以方案五为例进行说明(即一维分块为例进行说明),i从1开始进行递增,即在i=1时,部分输入数据的目标尺寸为输入数据的第一维度的最小颗粒度的1倍,将目标尺寸的部分输入数据作为初始数据,基于确定的常数数据的第一维度对常数数据进行一维分块,得到一维分块结果。Here, the maximum value of i can be determined in an incremental manner. The following uses scheme 5 as an example to illustrate (that is, one-dimensional block as an example), i is incremented from 1, that is, when i=1, the target size of some input data is the smallest of the first dimension of the input data. When the particle size is 1 times, the part of the input data of the target size is used as the initial data, and the constant data is divided into one-dimensional blocks based on the first dimension of the determined constant data, and the one-dimensional block result is obtained.
在i=1对应的一维分块结果指示常数数据allocate失败时,则该方案五为不可用的方案;在i=1对应的一维分块结果指示常数数据allocate成功时,则i的值加1(得到i=2),并返回至确定部分输入数据的目标尺寸的步骤,即目标尺寸为输入数据的第一维度的最小颗粒度的2倍,将目标尺寸的部分输入数据作为初始数据,基于确定的常数数据的第一维度对常数数据进行一维分块,得到一维分块结果。在i=2对应的一维分块结果指示常数数据allocate失败时,则确定i的最大值为1,递增过程截止;在一维分块结果指示常数数据allocate成功时,则i的值加1(此时i=3),再次返回至确定部分输入数据的目标尺寸的步骤,直至一维分块结果指示常数数据allocate失败为止。比如,若在i=6时,确定一维分块结果指示常数数据allocate失败,则确定i的最大值为5。在i的最大值为5时,则此方案可以得到5个分块结果。When the one-dimensional block result corresponding to i=1 indicates that the constant data allocate fails, the solution 5 is an unavailable solution; when the one-dimensional block result corresponding to i=1 indicates that the constant data allocate is successful, the value of i Add 1 (to get i=2), and return to the step of determining the target size of the partial input data, that is, the target size is twice the minimum granularity of the first dimension of the input data, and the partial input data of the target size is used as the initial data , perform one-dimensional block on the constant data based on the determined first dimension of the constant data, and obtain a one-dimensional block result. When the one-dimensional block result corresponding to i=2 indicates that the constant data allocate fails, the maximum value of i is determined to be 1, and the increment process ends; when the one-dimensional block result indicates that the constant data allocate is successful, the value of i is incremented by 1 (i=3 at this time), return to the step of determining the target size of part of the input data again, until the one-dimensional block result indicates that the constant data allocate fails. For example, if it is determined that the one-dimensional block result indicates that the constant data allocate fails when i=6, the maximum value of i is determined to be 5. When the maximum value of i is 5, this scheme can obtain 5 block results.
其中,分块结果指示常数数据allocate失败可以为常数数据按照第一维度的最小颗粒度进行划分后,得到的常数数据块和初始数据不满足计算设备的内存要求。若调度策略为乒乓调度,在按照第一维度的最小颗粒度划分后的常数数据块的数据容量的2倍大于计算设备的调度区域的内存时,则输入数据allocate失败。The block result indicating that the constant data allocate fails may be that after the constant data is divided according to the minimum granularity of the first dimension, the obtained constant data block and initial data do not meet the memory requirements of the computing device. If the scheduling policy is ping-pong scheduling, the input data allocate fails when twice the data capacity of the constant data block divided according to the minimum granularity of the first dimension is larger than the memory of the scheduling area of the computing device.
比如在i的最大值为5时,方案五中,可以包括以下5种分块策略:For example, when the maximum value of i is 5, the following five block strategies can be included in Scheme 5:
方式一、将输入数据的第一维度的最小颗粒度的1倍确定为部分输入数据的目标尺寸,将部分输入数据作为初始数据,基于确定的常数数据的第一维度,对常数数据进行一维分块,得到一维分块结果;Method 1: Determine 1 time of the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as the initial data, and based on the determined first dimension of the constant data, perform a one-dimensional analysis of the constant data. Block, get a one-dimensional block result;
方式二、将输入数据的第一维度的最小颗粒度的2倍确定为部分输入数据的目标尺寸,将部分输入数据作为初始数据,基于确定的常数数据的第一维度,对常数数据进行一维分块,得到一维分块结果;Method 2: Determine twice the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as initial data, and perform a one-dimensional analysis of the constant data based on the determined first dimension of the constant data. Block, get a one-dimensional block result;
……...
方式五、将输入数据的第一维度的最小颗粒度的5倍确定为部分输入数据的目标尺寸,将部分输入数据作为初始数据,基于确定的常数数据的第一维度,对常数数据进行一维分块,得到一维分块结果。Method 5: Determine 5 times the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as initial data, and perform one-dimensional analysis on the constant data based on the determined first dimension of the constant data. Block to get a one-dimensional block result.
还可以将部分常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块。Part of the constant data may also be used as initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data.
一种可选实施方式中,方案七或方案八中,将部分常数数据作为初始数据,基于确定的输入数据的维度参数,对输入数据进行指定维度分块,得到分块结果,包括:In an optional embodiment, in Scheme 7 or Scheme 8, some constant data is used as initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension, and a block result is obtained, including:
一、基于常数数据的第一维度的最小颗粒度的j倍,确定部分常数数据的目标尺寸;1. Determine the target size of part of the constant data based on j times the minimum particle size of the first dimension of the constant data;
二、分别将目标尺寸的部分常数数据作为初始数据,基于确定的输入数据的维度参数,对输入数据进行指定维度的分块,得到分块结果。2. Part of the constant data of the target size is respectively used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of the specified dimensions, and the block result is obtained.
这里,可以通过递增的方式,确定j的最大值。下述以对方案七为例进行说明,j从1开始进行 递增,即在j=1时,部分输入数据的目标尺寸为输入数据的第一维度的最小颗粒度的1倍,将目标尺寸的部分输入数据作为初始数据,基于确定的常数数据的第一维度对常数数据进行一维分块,得到一维分块结果。Here, the maximum value of j can be determined in an incremental manner. The following is an example of solution 7, where j starts to increase from 1, that is, when j=1, the target size of some input data is 1 times the minimum granularity of the first dimension of the input data, and the target size is Part of the input data is used as initial data, and one-dimensional block is performed on the constant data based on the determined first dimension of the constant data to obtain a one-dimensional block result.
在j=1对应的一维分块结果指示输入数据allocate失败时,则该方案七为不可用的方案;在j=1对应的一维分块结果指示输入数据allocate成功时,则j的值加1(得到j=2),并返回至确定部分输入数据的目标尺寸的步骤,直至得到的一维分块结果指示输入数据allocate失败为止。比如,若在j=6时,确定得到的一维分块结果指示输入数据allocate失败,则确定j的最大值为5。在j的最大值为5时,则此方案可以得到5个分块结果。When the one-dimensional block result corresponding to j=1 indicates that the input data allocate fails, then the seventh scheme is an unavailable scheme; when the one-dimensional block result corresponding to j=1 indicates that the input data allocate is successful, the value of j Add 1 (getting j=2), and return to the step of determining the target size of the partial input data until the resulting one-dimensional block result indicates that the input data allocate failed. For example, if it is determined that the obtained one-dimensional block result indicates that the input data allocate fails when j=6, the maximum value of j is determined to be 5. When the maximum value of j is 5, this scheme can obtain 5 block results.
其中,分块结果指示输入数据allocate失败可以为输入数据按照第一维度的最小颗粒度进行划分后,得到的输入数据块和初始数据不满足计算设备的内存要求。若调度策略为乒乓调度,在按照第一维度的最小颗粒度划分后的输入数据块的数据容量的2倍大于计算设备的调度区域的内存时,则输入数据allocate失败。比如,若初始数据、调度数据乒(按照第一维度的最小颗粒度划分后的输入数据块)、和调度数据乓(按照第一维度的最小颗粒度划分后的输入数据块)不满足计算设备的内存要求时,则确定输入数据allocate失败。The block result indicating that the input data allocate fails may be that after the input data is divided according to the minimum granularity of the first dimension, the obtained input data block and initial data do not meet the memory requirements of the computing device. If the scheduling policy is ping-pong scheduling, the input data allocate fails when twice the data capacity of the input data block divided according to the minimum granularity of the first dimension is larger than the memory of the scheduling area of the computing device. For example, if the initial data, scheduling data ping (input data block divided according to the minimum granularity of the first dimension), and scheduling data pong (input data block divided according to the minimum granularity of the first dimension) do not satisfy the computing device When the memory requirement is exceeded, it is determined that the input data allocate failed.
比如在j的最大值为6时,方案七中,可以包括以下6种分块策略:For example, when the maximum value of j is 6, in Scheme 7, the following 6 block strategies can be included:
方式一、将常数数据的第一维度的最小颗粒度的1倍确定为部分常数数据的目标尺寸,将部分常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到一维分块结果;Method 1: Determine 1 time of the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, use the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;
方式二、将常数数据的第一维度的最小颗粒度的2倍确定为部分常数数据的目标尺寸,将部分常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到一维分块结果;Method 2: Determine twice the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, take the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;
……...
方式六、将常数数据的第一维度的最小颗粒度的6倍确定为部分常数数据的目标尺寸,将部分常数数据作为初始数据,基于确定的输入数据的第一维度,对输入数据进行一维分块,得到一维分块结果。Mode 6: Determine 6 times the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, take the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block to get a one-dimensional block result.
这里,对输入数据进行切块的第一维度、第二维度,可以根据运行需求和/或算子类型等信息进行设置;以及对常数数据进行切块的第一维度、第二维度,可以根据运行需求和/或算子类型等信息进行设置。比如,若算子为卷积算子时,常数数据的第一维度可以为输出通道(output channel,后续简称OC)维度,第二维度可以为输入通道(input channel,后续简称IC)维度。Here, the first dimension and the second dimension for dicing the input data can be set according to information such as operation requirements and/or operator types; and the first dimension and the second dimension for dicing the constant data can be set according to Information such as operation requirements and/or operator types are set. For example, if the operator is a convolution operator, the first dimension of the constant data may be the output channel (hereinafter referred to as OC) dimension, and the second dimension may be the input channel (hereinafter referred to as IC) dimension.
这里,设置多种分块策略,可以使得每个待处理网络层能够选择较优的目标算子和与目标算子匹配的目标分块策略。Here, setting up a variety of blocking strategies can enable each network layer to be processed to select an optimal target operator and a target blocking strategy that matches the target operator.
在一种可选实施方式中,在指定维度为一维,维度参数包括第一维度的情况下,分别将常数数据和输入数据作为目标数据,基于确定的目标数据的第一维度,对目标数据进行一维分块,得到一维分块结果,包括:In an optional embodiment, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined first dimension of the target data, the target data One-dimensional block is performed to obtain one-dimensional block results, including:
A1,将目标数据的第一维度对应的最小颗粒度的k倍,确定为目标分块尺寸,基于目标分块尺寸,将目标数据按照第一维度进行一维分块,得到目标数据对应的多个目标数据块;其中,k为正整数;A1: Determine k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, divide the target data into one-dimensional blocks according to the first dimension, and obtain the multi-dimensional block corresponding to the target data. target data blocks; among them, k is a positive integer;
A2,在确定多个目标数据块和初始数据满足设置的分块条件的情况下,将目标数据的第一维度对应的最小颗粒度的k+1倍作为更新后的目标分块尺寸,返回至基于目标分块尺寸,将目标数据按照第一维度进行一维分块的步骤,直至确定多个目标数据块和初始数据不满足设置的分块条件,将目标数据的第一维度对应的最小颗粒度的k倍确定为分块结果。A2, when it is determined that multiple target data blocks and the initial data meet the set block conditions, take k+1 times the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to Based on the size of the target block, the target data is divided into one-dimensional blocks according to the first dimension until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the smallest particle corresponding to the first dimension of the target data is divided. k times the degree is determined as the block result.
A3,在初始数据、和k等于1时生成的多个目标数据块不满足设置的分块条件的情况下,确定分块结果为一维分块失败。A3. If the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, determine that the block result is a one-dimensional block failure.
采用上述方法,不断增大目标分块尺寸,通过不断尝试的方式确定使得计算设备的内存使用率较高的分块结果,有利于降低计算设备的内存资源浪费。By using the above method, the size of the target block is continuously increased, and the block result that makes the memory usage rate of the computing device higher is determined by continuous attempts, which is beneficial to reduce the waste of memory resources of the computing device.
在步骤A1中,k为正整数。从k=1开始,将目标数据的第一维度对应的最小颗粒度,确定为目标分块尺寸,按照目标分块尺寸,将目标数据按照第一维度进行一维分块,得到目标数据对应的多个目标数据块。得到的每个目标数据块的第一维度的尺寸与目标分块尺寸一致,每个目标数据块的除第一维度之外的其他维度的尺寸与目标数据的对应维度的尺寸一致。In step A1, k is a positive integer. Starting from k=1, the minimum granularity corresponding to the first dimension of the target data is determined as the target block size, and according to the target block size, the target data is divided into one-dimensional blocks according to the first dimension, and the corresponding target data is obtained. Multiple target data blocks. The obtained size of the first dimension of each target data block is consistent with the size of the target partition, and the size of each of the target data blocks except the first dimension is consistent with the size of the corresponding dimension of the target data.
比如,若第一维度的最小颗粒度为32,目标数据的尺寸信息为64×64×128,则目标分块尺寸为32,按照目标分块尺寸,将目标数据按照第一维度进行一维分块,得到多个目标数据块,每个目标数据块的尺寸可以为32×64×128。其中,目标数据块的数量可以根据实际情况进行确定。For example, if the minimum granularity of the first dimension is 32, and the size information of the target data is 64×64×128, then the target block size is 32. According to the target block size, the target data is divided into one dimension according to the first dimension. block to obtain multiple target data blocks, and the size of each target data block can be 32×64×128. The number of target data blocks may be determined according to actual conditions.
其中,第一维度可以根据需要进行设置,比如,输入数据的第一维度可以为宽度W维度,第二维度可以为输入通道IC维度;常数数据的第一维度可以为输出通道OC维度,第二维度可以为输入通道IC维度。The first dimension can be set as required. For example, the first dimension of the input data can be the width W dimension, and the second dimension can be the input channel IC dimension; the first dimension of the constant data can be the output channel OC dimension, and the second dimension can be the output channel OC dimension. The dimension may be the input channel IC dimension.
进而,可以判断多个目标数据块和初始数据是否满足设置的分块条件,若满足,则将目标数据的第一维度对应的最小颗粒度的2倍作为更新后的目标分块尺寸,返回至按照目标分块尺寸,将目标数据按照第一维度进行一维分块的步骤,直至确定多个目标数据块和初始数据不满足设置的分块条件,并将目标数据的第一维度对应的最小颗粒度的k倍确定为分块结果。比如,在k=5时,确定k=5时生成的多个目标数据块和初始数据不满足设置的分块条件,将目标数据的第一维度对应的最小颗粒度的4倍确定为分块结果。即在运行待处理网络层时,可以将第一维度的最小颗粒度的4倍,作为目标分块尺寸,按照目标分块尺寸对待处理网络层的目标算子对应的目标数据进行一维分块。Further, it can be judged whether the multiple target data blocks and the initial data meet the set block conditions, and if so, take twice the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to According to the target block size, the target data is divided into one-dimensional blocks according to the first dimension, until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the minimum corresponding to the first dimension of the target data is determined. The k times the granularity is determined as the block result. For example, when k=5, it is determined that multiple target data blocks and initial data generated when k=5 do not meet the set block conditions, and 4 times the minimum granularity corresponding to the first dimension of the target data is determined as block result. That is, when running the network layer to be processed, 4 times the minimum granularity of the first dimension can be used as the target block size, and the target data corresponding to the target operator of the network layer to be processed can be divided into one-dimensional blocks according to the target block size. .
若不满足(即在k=1时生成的多个目标数据块和初始数据不满足设置的分块条件),则确定分块结果为一维分块失败。If it is not satisfied (that is, the multiple target data blocks and the initial data generated when k=1 do not satisfy the set partitioning condition), the result of the partitioning is determined to be a one-dimensional partitioning failure.
一种可选实施方式中,在指定维度为二维,维度参数包括第二维度的情况下,分别将常数数据和输入数据作为目标数据,基于确定的目标数据的第一维度、第二维度,对目标数据进行二维分块,得到分块结果,包括:In an optional embodiment, when the specified dimension is two-dimensional and the dimension parameter includes the second dimension, the constant data and the input data are respectively used as the target data, and based on the determined first dimension and the second dimension of the target data, Perform two-dimensional block on the target data to obtain block results, including:
B1,将目标数据的第一维度对应的最小颗粒度的y倍,确定为第一目标分块尺寸,基于第一目标分块尺寸,将所述目标数据按照第一维度进行一维分块,得到目标数据对应的多个中间数据块;其中,y为正整数;B1, determining y times the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, one-dimensional block the target data according to the first dimension, Obtain multiple intermediate data blocks corresponding to the target data; wherein, y is a positive integer;
B2,将目标数据的第二维度对应的最小颗粒度的x倍,确定为第二目标分块尺寸;基于第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块,得到各个中间数据块分别对应的多个目标数据块;其中,x为正整数;B2, determine x times the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, divide each intermediate data block into two-dimensional blocks according to the second dimension , to obtain multiple target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;
B3,在确定多个目标数据块和初始数据满足设置的分块条件的情况下,则将目标数据的第二维度对应的最小颗粒度的x+1倍作为更新后的第二目标分块尺寸,返回至基于第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块的步骤,直至确定多个目标数据块和初始数据不满足设置的分块条件为止,将目标数据的第二维度对应的最小颗粒度的x倍确定为分块结果。B3, when it is determined that the multiple target data blocks and the initial data meet the set block conditions, then take x+1 times of the minimum granularity corresponding to the second dimension of the target data as the updated second target block size , return to the step of dividing each intermediate data block into two-dimensional blocks according to the second dimension based on the second target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set block conditions, the target block The x times of the minimum granularity corresponding to the second dimension of the data is determined as the block result.
在B1中,y为初始值为1的正整数,比如在设置的y的最大值为3时,则可以将y确定为1,执行步骤B1~B3,得到一个二维分块结果;再将y确定为2,执行步骤B1~B3,得到一个二维分块结果;y确定为3,执行步骤B1~B3,得到一个二维分块结果,即可以得到3个二维分块结果。In B1, y is a positive integer with an initial value of 1. For example, when the maximum value of y is set to 3, y can be determined as 1, and steps B1 to B3 are executed to obtain a two-dimensional block result; y is determined to be 2, and steps B1 to B3 are performed to obtain a two-dimensional block result; y is determined to be 3, and steps B1 to B3 are performed to obtain a two-dimensional block result, that is, three two-dimensional block results can be obtained.
以y=1为例对二维分块过程进行说明,若第一维度对应的最小颗粒度为32,目标数据的尺寸为128×128×256,则可以基于第一目标分块尺寸,将目标数据按照第一维度进行一维分块,得到目标 数据对应的多个目标中间数据块,每个目标中间数据块的尺寸可以为32×128×256。其中,目标中间数据块的数量可以根据实际情况进行确定。Taking y=1 as an example to illustrate the two-dimensional block process, if the minimum granularity corresponding to the first dimension is 32 and the size of the target data is 128×128×256, then the target The data is divided into one-dimensional blocks according to the first dimension, and multiple target intermediate data blocks corresponding to the target data are obtained, and the size of each target intermediate data block may be 32×128×256. The number of target intermediate data blocks may be determined according to actual conditions.
在B2中,承接B1中的示例继续说明,x为正整数,从x=1开始,将目标数据的第二维度对应的最小颗粒度的1倍,确定为第二目标分块尺寸,比如,若第二维度的最小颗粒度为32,则第二目标分块尺寸为32,基于第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块,得到各个中间数据块对应的多个目标数据块,即得到多个目标数据块,每个目标数据块的尺寸可以为32×32×256。In B2, continue the description following the example in B1, x is a positive integer, starting from x=1, determine 1 times the minimum granularity corresponding to the second dimension of the target data as the second target block size, for example, If the minimum granularity of the second dimension is 32, the second target block size is 32. Based on the second target block size, each intermediate data block is divided into two-dimensional blocks according to the second dimension to obtain each intermediate data block. The corresponding multiple target data blocks, that is, multiple target data blocks are obtained, and the size of each target data block may be 32×32×256.
在B3中,可以判断多个目标数据块和初始数据是否满足设置的分块条件,若满足,则将目标数据的第二维度对应的最小颗粒度的2(即x+1)倍作为更新后的第二目标分块尺寸,返回至基于第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块的步骤,直至确定多个目标数据块和初始数据不满足设置的分块条件为止,将目标数据的第二维度对应的最小颗粒度的x倍确定为分块结果。In B3, it can be judged whether the multiple target data blocks and the initial data meet the set segmentation conditions, and if so, 2 (ie x+1) times the minimum granularity corresponding to the second dimension of the target data is used as the updated The second target block size is returned to based on the second target block size, and each intermediate data block is divided into two-dimensional blocks according to the second dimension, until it is determined that multiple target data blocks and initial data do not meet the set Up to the block condition, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.
比如,在x=3时,确定x=3时生成的多个目标数据块和初始数据不满足设置的分块条件,将目标数据的第二维度对应的最小颗粒度的2倍确定为分块结果。即在运行待处理网络层时,可以将第一维度的最小颗粒度,作为第一目标分块尺寸,将第二维度的最小颗粒度的2倍,作为第二目标分块尺寸,基于第一目标分块尺寸和第二目标分块尺寸,对待处理网络层的目标算子对应的目标数据进行二维分块。For example, when x=3, it is determined that the multiple target data blocks and the initial data generated when x=3 do not meet the set block conditions, and 2 times the minimum granularity corresponding to the second dimension of the target data is determined as the block result. That is, when running the network layer to be processed, the minimum granularity of the first dimension can be used as the first target block size, and twice the minimum granularity of the second dimension can be used as the second target block size, based on the first target block size. The target block size and the second target block size are two-dimensional blocks for the target data corresponding to the target operator of the network layer to be processed.
一种可选实施方式中,在待处理网络层对应的参数数据还包括输出数据的情况下,确定多个目标数据块和初始数据满足设置的分块条件,包括:在确定初始数据、输出数据、和每个目标数据块分别满足计算设备的内存要求,以及初始数据、输出数据、和每个目标数据块分别满足计算设备中DMA传输要求的情况下,确定多个目标数据块和初始数据满足设置的分块条件。In an optional embodiment, when the parameter data corresponding to the network layer to be processed also includes output data, determining that multiple target data blocks and initial data meet the set block conditions, including: determining initial data, output data , and each target data block respectively meet the memory requirements of the computing device, and when the initial data, output data, and each target data block respectively meet the DMA transfer requirements in the computing device, determine that multiple target data blocks and initial data satisfy The set block condition.
这里,计算设备的内存要求可以根据用户需求和/或计算设备需求进行设置。比如,可以确定初始数据、输出数据、和每个目标数据块的数据容量总和,是否小于或等于设置的计算设备的内存容量,若是,则确定满足计算设备的内存要求。Here, the memory requirements of the computing device may be set according to user requirements and/or computing device requirements. For example, it can be determined whether the total data capacity of the initial data, output data, and each target data block is less than or equal to the set memory capacity of the computing device, and if so, it is determined to meet the memory requirements of the computing device.
或者,还可以确定初始数据的数据容量是否小于或等于在计算设备的内存上为初始数据分配的第一局部内存容量,确定输出数据的数据容量是否小于或等于在计算设备的内存上为输出数据分配的第二局部内存容量,以及确定每个目标数据块的数据容量是否小于或等于在计算设备的内存上为目标数据分配的三局部内存容量,若初始数据、输出数据和每个目标数据块均满足要求,则确定满足计算设备的内存要求。Alternatively, it is also possible to determine whether the data capacity of the initial data is less than or equal to the first local memory capacity allocated for the initial data on the memory of the computing device, and determine whether the data capacity of the output data is less than or equal to the output data on the memory of the computing device. The allocated second local memory size, and determining whether the data size of each target data block is less than or equal to the three local memory sizes allocated for the target data on the memory of the computing device, if the initial data, output data and each target data block If all requirements are met, it is determined that the memory requirements of the computing device are met.
在具体实施时,还可以设置专用内存和公共内存,若设置常数数据存储在公共内存中,输入数据和输出数据存储在专用内存上,则可以判断初始数据、输出数据、和每个目标数据块是否均满足对应的专用内存和公共内存的内存要求,若是,则确定满足计算设备的内存要求。即在初始数据为输入数据,目标数据块为常数数据对应的目标数据块,则判断初始数据和输出数据的数据容量是否小于或等于设置的专用内存的内存容量,以及判断每个目标数据块是否小于或等于设置的公共内存的内存容量,若均满足,则确定满足计算设备的内存要求。In specific implementation, special memory and public memory can also be set. If the constant data is set to be stored in the public memory, and the input data and output data are stored in the special memory, the initial data, output data, and each target data block can be determined. Whether both of them meet the memory requirements of the corresponding dedicated memory and public memory, and if so, determine that the memory requirements of the computing device are met. That is, when the initial data is the input data and the target data block is the target data block corresponding to the constant data, then judge whether the data capacity of the initial data and output data is less than or equal to the set memory capacity of the dedicated memory, and judge whether each target data block is It is less than or equal to the set memory capacity of the public memory. If all are satisfied, it is determined that the memory requirements of the computing device are satisfied.
示例性的,可以确定了每个目标数据块之后,对目标数据块、初始数据、输出数据进行allocate尝试,若allocate尝试成功,则确定初始数据、输出数据和每个目标数据块满足计算设备的内存要求。Exemplarily, after each target data block is determined, an attempt to allocate the target data block, initial data, and output data is performed. If the allocate attempt is successful, it is determined that the initial data, output data, and each target data block satisfy the requirements of the computing device. memory requirements.
其中,DMA传输要求可以根据实际需要进行确定。比如,若确定初始数据、输出数据和每个目标数据块的数据容量总和,小于或等于DMA可传输的数据容量,即在确定DMA任务成功建立时,则确定满足计算设备中DMA传输要求。Among them, the DMA transmission requirements can be determined according to actual needs. For example, if it is determined that the total data capacity of the initial data, output data, and each target data block is less than or equal to the data capacity that can be transferred by DMA, that is, when it is determined that the DMA task is successfully established, it is determined that the DMA transfer requirements in the computing device are met.
在确定初始数据、输出数据和每个目标数据块满足计算设备的内存要求,以及满足计算设备中 DMA传输要求时,确定多个目标数据块和初始数据满足设置的分块条件。When it is determined that the initial data, the output data and each target data block meet the memory requirements of the computing device and meet the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions.
采用上述方法,在初始数据、输出数据和每个目标数据块,满足计算设备的内存要求和计算设备中DMA传输要求时,确定多个目标数据块和所述初始数据满足设置的分块条件,保证了分块策略与计算设备的运行要求相匹配。Using the above method, when the initial data, the output data and each target data block meet the memory requirements of the computing device and the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions, It is ensured that the partitioning strategy matches the operating requirements of the computing device.
针对S103:在确定了目标神经网络每个待处理网络层对应的目标算子和目标分块策略之后,可以基于至少一个待处理网络层分别对应的目标分块策略,运行包含目标算子的目标神经网络。For S103: after determining the target operator and the target block strategy corresponding to each network layer to be processed in the target neural network, the target block strategy including the target operator can be run based on the target block strategy corresponding to at least one network layer to be processed respectively. Neural Networks.
比如,可以将待处理图像输入至目标神经网络中,计算设备利用每个待处理网络层分别对应的目标分块策略和目标算子,对待处理图像进行特征提取,确定待处理图像对应的检测结果,比如,该检测结果可以为待处理图像中包括的目标对象的类别、目标对象的位置信息、目标对象的轮廓信息等。For example, the image to be processed can be input into the target neural network, and the computing device uses the target segmentation strategy and target operator corresponding to each network layer to be processed to perform feature extraction on the image to be processed, and determine the detection result corresponding to the image to be processed. For example, the detection result may be the category of the target object included in the image to be processed, the position information of the target object, the contour information of the target object, and the like.
示例性的,参见图4所示的一种神经网络运行方法中,计算设备的软件硬件调度的示意图,结合图4对使用乒乓调度对待处理网络层的参数数据的处理过程进行说明,其中,可以将计算设备的内存划分为初始数据区、调度数据区乒、调度数据区乓、输出数据区乒、输出数据区乓。在初始数据为输入数据时,调度数据为常数数据;在初始数据为常数数据时,调度数据为输入数据。Exemplarily, referring to a schematic diagram of software and hardware scheduling of a computing device in a method for operating a neural network shown in FIG. 4 , the process of using ping-pong scheduling to process the parameter data of the network layer to be processed will be described with reference to FIG. 4 . The memory of the computing device is divided into an initial data area, a scheduling data area ping, a scheduling data area ping, an output data area ping, and an output data area ping. When the initial data is input data, the scheduling data is constant data; when the initial data is constant data, the scheduling data is input data.
由图4可知,计算设备与DMA在并行运行,DMA先将初始数据和调度乒(即调度数据乒)传输至计算设备的对应内存区域内(即将初始数据传输至计算设备的初始数据区对应的内存区域、将调度数据乒输出至计算设备的调度数据区乒对应的内存区域中);计算设备对初始数据和调度乒进行处理,同时,DMA还可以将调度乓(即调度数据乓)传输至计算设备的调度数据乓对应的内存区域。It can be seen from Figure 4 that the computing device and the DMA are running in parallel, and the DMA first transmits the initial data and the scheduling ping (that is, the scheduling data ping) to the corresponding memory area of the computing device (that is, transferring the initial data to the initial data area corresponding to the computing device). memory area, output the scheduling data ping to the memory area corresponding to the scheduling data area ping of the computing device); the computing device processes the initial data and the scheduling ping, and at the same time, the DMA can also transmit the scheduling pong (that is, the scheduling data pong) to The memory area corresponding to the scheduling data pong of the computing device.
在计算设备对初始数据和调度乒处理结束后,生成输出乒(即输出数据乒),将输出乒放置在计算设备的输出数据区乒对应的内存区域,并通过DMA从计算设备的输出数据区乒对应的内存区域中获取输出乒,再将输出乒传输至对应外部内存(比如DDR)。进而计算设备对接收到的调度乓进行处理,同时DMA再将下一调度乒传输至计算设备的调度乒对应的内存区域,重复上述过程,直至待处理层的参数数据处理完成为止。After the computing device finishes processing the initial data and scheduling ping, an output ping (ie, output data ping) is generated, and the output ping is placed in the memory area corresponding to the output data area ping of the computing device, and the output ping is sent from the output data area of the computing device through DMA. The output ping is obtained from the memory area corresponding to the ping, and then the output ping is transmitted to the corresponding external memory (such as DDR). The computing device then processes the received scheduling ping, and at the same time, the DMA transmits the next scheduling ping to the memory area corresponding to the scheduling ping of the computing device, and repeats the above process until the processing of the parameter data of the layer to be processed is completed.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above method of the specific implementation, the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
基于相同的构思,本公开实施例还提供了一种神经网络运行装置,参见图5所示,为本公开实施例提供的神经网络运行装置的架构示意图,包括第一确定模块501、第二确定模块502、运行模块503,具体的:Based on the same concept, an embodiment of the present disclosure also provides a neural network operating apparatus. Referring to FIG. 5 , a schematic diagram of the architecture of the neural network operating apparatus provided by the embodiment of the present disclosure includes a first determining module 501, a second determining Module 502, running module 503, specifically:
第一确定模块501,用于确定目标神经网络中的待处理网络层;The first determination module 501 is used to determine the network layer to be processed in the target neural network;
第二确定模块502,用于从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略;其中,所述多个算子中的每个算子均用于实现所述待处理网络层对应的功能,所述多种分块策略中的每个分块策略均匹配用于运行所述目标神经网络的计算设备的运行要求;The second determination module 502 is configured to determine, from the determined multiple operators and multiple block strategies, the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; wherein the multiple Each of the operators is used to implement the function corresponding to the network layer to be processed, and each of the multiple block strategies matches the computing device used to run the target neural network operating requirements;
运行模块503,用于基于所述待处理网络层对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。A running module 503 is configured to run the target neural network including the target operator based on the target block strategy corresponding to the network layer to be processed.
一种可能的实施方式中,所述分块策略用于对所述待处理网络层对应的目标算子的参数数据进行分块;In a possible implementation manner, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed;
在所述多种分块策略中,基于采用所述目标分块策略对所述目标算子的参数数据进行分块得到的参数数据,运行所述待处理网络层的资源消耗最小。Among the multiple blocking strategies, based on the parameter data obtained by segmenting the parameter data of the target operator by using the target blocking strategy, the resource consumption of running the network layer to be processed is minimal.
一种可能的实施方式中,在所述待处理网络层为多个的情况下,所述第二确定模块502,在从 确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略时,用于:In a possible implementation manner, in the case that there are multiple network layers to be processed, the second determination module 502 determines the target from the determined multiple operators and multiple block strategies. When the target operator and target block strategy corresponding to the network layer to be processed in the neural network are used for:
针对所述目标神经网络中的每个待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略;For each to-be-processed network layer in the target neural network, a target candidate operator corresponding to the to-be-processed network layer is determined from the plurality of operators, and a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple block strategies The target candidate block strategy matched by the target candidate operator;
在存在任一待处理网络层对应的所述目标候选算子为多个和/或所述目标候选分块策略为多个的情况下,基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的所述目标算子和所述目标分块策略。In the case that there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators corresponding to each network layer to be processed and The target candidate block strategy is to determine the target operator and the target block strategy corresponding to each network layer to be processed.
一种可能的实施方式中,所述第二确定模块502,在基于各个待处理网络层分别对应的目标候选算子和目标候选分块策略,确定每个待处理网络层对应的所述目标算子和所述目标分块策略时,用于:In a possible implementation, the second determination module 502 determines the target operator corresponding to each network layer to be processed based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed. child and the target chunking strategy, used to:
基于各个待处理网络层分别对应的目标候选算子、和与所述目标候选算子对应的目标候选分块策略,确定所述目标神经网络对应的多个测试网络;其中,每个测试网络中包括各个所述待处理网络层对应的一个所述目标候选算子、和与该目标候选算子匹配的一个目标候选分块策略;A plurality of test networks corresponding to the target neural network are determined based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, in each test network Including a target candidate operator corresponding to each of the network layers to be processed, and a target candidate block strategy matching the target candidate operator;
分别运行所述多个测试网络,得到多个测试结果,其中,每个测试网络对应一个测试结果;Running the multiple test networks respectively to obtain multiple test results, wherein each test network corresponds to one test result;
基于所述多个测试结果,从所述多个测试网络中选取目标测试网络;Selecting a target test network from the plurality of test networks based on the plurality of test results;
将所述目标测试网络中待处理网络层的目标候选算子和目标候选分块策略,确定为所述目标神经网络中待处理网络层分别对应的所述目标算子和目标分块策略。The target candidate operator and target candidate block strategy of the network layer to be processed in the target test network are determined as the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network respectively.
一种可能的实施方式中,第二确定模块502,在针对所述目标神经网络中的每个待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略时,用于:In a possible implementation manner, the second determination module 502, for each network layer to be processed in the target neural network, determines the target candidate operator corresponding to the network layer to be processed from the plurality of operators. , and when determining a target candidate block strategy matching the target candidate operator from the multiple block strategies, used for:
针对所述待处理网络层,从所述多个算子中,确定一个或多个第一候选算子;For the to-be-processed network layer, from the plurality of operators, determine one or more first candidate operators;
基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略。Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.
一种可能的实施方式中,所述资源消耗情况用计算开销值表示,所述第二确定模块502,用于根据以下步骤确定所述第一候选算子在每种分块策略下的计算开销值:In a possible implementation manner, the resource consumption is represented by a computational cost value, and the second determination module 502 is configured to determine the computational cost of the first candidate operator under each block strategy according to the following steps: value:
确定所述第一候选算子在预设尺寸下对应的受限场景,其中,所述受限场景为基于在预设尺寸下所述第一候选算子对应的数据容量的计算耗时和传输耗时确定的;Determine a restricted scenario corresponding to the first candidate operator under a preset size, wherein the restricted scenario is based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size time consuming;
在所述受限场景属于带宽受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所述第一候选算子在所述分块策略下对应的直接内存操作DMA数据传输总量、DMA任务个数、和数据转换开销;基于所述DMA数据传输总量、所述DMA任务个数、所述数据转换开销、和所述计算设备对应的DMA速率、DMA任务开销,确定所述第一候选算子在所述分块策略下的计算开销值;其中,所述数据转换开销为按照所述第一候选算子对应的目标数据排布方式,对所述第一候选算子对应的输入数据进行数据排布方式转换所消耗的时间;In the case where the restricted scenario is a bandwidth-restricted scenario, the direct memory operation DMA corresponding to the first candidate operator under the partitioning strategy is determined based on the partitioning result of the partitioning strategy. The total amount of data transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device , determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is calculated according to the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the candidate operator to convert the data arrangement;
在所述受限场景属于计算受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所述第一候选算子在所述分块策略下对应的参数数据的计算耗时、所述第一候选算子的算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销;基于所述计算耗时、所述算子调用次数、初始数据传输总量、数据转换开销、DMA任务开销、DMA任务个数、和所述计算设备对应的DMA速率,确定所述第一候选算子在所述分块策略下的计算开销值。In the case where the restricted scene belongs to a calculation-limited scene, based on the segmentation result of the segmentation strategy, determine the calculation of the parameter data corresponding to the first candidate operator under the segmentation strategy Time consumption, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; amount, data conversion overhead, DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, to determine the computing overhead value of the first candidate operator under the blocking strategy.
一种可能的实施方式中,所述第二确定模块502,在基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述 待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的一个或多个目标候选分块策略时,用于:In a possible implementation manner, the second determining module 502, based on the resource consumption situation of the first candidate operator under each of the multiple partitioning strategies, determines from the first candidate operator. In a candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target candidate operator. When a block strategy is used, it is used to:
从所述第一候选算子对应的多个所述资源消耗情况中,选取满足预设条件的目标资源消耗情况;其中,一个第一候选算子在一种分块策略下对应一个所述资源消耗情况;A target resource consumption situation that satisfies a preset condition is selected from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, one first candidate operator corresponds to one of the resources under a block strategy consumption;
将所述目标资源消耗情况对应的分块策略,确定为候选分块策略,并基于所述候选分块策略,运行包含所述目标资源消耗情况对应的第二候选算子的待处理网络层,确定与所述候选分块策略及所述第二候选算子对应的测试结果;Determining the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer containing the second candidate operator corresponding to the target resource consumption situation, determining a test result corresponding to the candidate block strategy and the second candidate operator;
基于所述测试结果,确定所述待处理网络层对应的一个或多个目标候选算子以及与所述目标候选算子对应的目标候选分块策略。Based on the test result, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator are determined.
一种可能的实施方式中,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略之前,还包括:In a possible implementation manner, from the first candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target Before the target candidate block strategy corresponding to the candidate operator, it also includes:
对齐模块504,用于基于确定的所述目标神经网络对应的最小颗粒度信息,对所述第一候选算子对应的参数数据进行对齐操作,得到所述第一候选算子对应的对齐后的参数数据;The alignment module 504 is configured to perform an alignment operation on the parameter data corresponding to the first candidate operator based on the determined minimum granularity information corresponding to the target neural network to obtain the aligned data corresponding to the first candidate operator. parameter data;
其中,所述最小颗粒度信息中包括所述参数数据在不同维度下对应的最小颗粒度;对齐后的参数数据在不同维度下的尺寸,是所述最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍。Wherein, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions is the smallest in the corresponding dimension indicated by the minimum granularity information Integer multiple of particle size.
一种可能的实施方式中,在所述参数数据包括输入数据和常数数据的情况下,所述多种分块策略包括以下至少一种:In a possible implementation manner, when the parameter data includes input data and constant data, the multiple block strategies include at least one of the following:
将全部输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;所述初始数据为计算设备运行所述目标神经网络时,写入直接内存操作DMA任务所分配的初始数据区域内的数据;Taking all the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; the initial data is when the computing device runs the target neural network , write the data in the initial data area allocated by the direct memory operation DMA task;
将全部常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;Taking all the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result;
将部分输入数据作为所述初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;其中,部分输入数据的目标尺寸为根据所述输入数据的第一维度的最小颗粒度确定;Taking part of the input data as the initial data, and based on the determined dimension parameters of the constant data, the constant data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the input data is according to the Minimum granularity determination of the first dimension of the input data;
将部分常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;其中,部分常数数据的目标尺寸为根据所述常数数据的第一维度的最小颗粒度确定的。Taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is according to the The minimum granularity of the first dimension of the constant data is determined.
一种可能的实施方式中,所述将部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果,包括:In a possible implementation manner, taking part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including:
基于所述输入数据的第一维度的最小颗粒度的i倍,确定所述部分输入数据的目标尺寸;determining the target size of the part of the input data based on i times the minimum granularity of the first dimension of the input data;
分别将所述目标尺寸的所述部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;The part of the input data of the target size is respectively used as initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result;
其中,i为在确定了部分输入数据的目标尺寸后,使得所述部分输入数据的数据容量,以及基于所述常数数据的维度参数的最小颗粒度,确定的常数数据块的数据容量,满足计算设备的内存要求的正整数。Wherein, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the data capacity of the constant data block determined based on the minimum granularity of the dimension parameter of the constant data, which satisfies the calculation A positive integer for the memory requirement of the device.
一种可能的实施方式中,所述将部分常数数据作为初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果,包括:In a possible implementation manner, taking part of the constant data as initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of a specified dimension to obtain a block result, including:
基于所述常数数据的第一维度的最小颗粒度的j倍,确定所述部分常数数据的目标尺寸;determining a target size of the portion of constant data based on j times the minimum granularity of the first dimension of the constant data;
分别将所述目标尺寸的所述部分常数数据作为初始数据,基于确定的所述输入数据的维度参数, 对所述输入数据进行指定维度的分块,得到分块结果;The part of the constant data of the target size is respectively used as initial data, and based on the determined dimension parameter of the input data, the input data is subjected to a block of a specified dimension to obtain a block result;
其中,j为在确定了部分常数数据的目标尺寸后,使得所述部分常数数据的数据容量,以及基于所述输入数据的维度参数的最小颗粒度,确定的输入数据块的数据容量,满足计算设备的内存要求的正整数。Among them, j is the data capacity of the partial constant data after the target size of the partial constant data is determined, and the determined data capacity of the input data block based on the minimum granularity of the dimension parameter of the input data, which satisfies the calculation A positive integer for the memory requirement of the device.
一种可能的实施方式中,在所述指定维度为一维,所述维度参数包括第一维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度,对所述目标数据进行一维分块,得到一维分块结果,包括:In a possible implementation, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined target In the first dimension of the data, one-dimensional block is performed on the target data, and a one-dimensional block result is obtained, including:
将所述目标数据的第一维度对应的最小颗粒度的k倍,确定为目标分块尺寸,基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块,得到所述目标数据对应的多个目标数据块;其中,k为正整数;Determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, one-dimensional block the target data according to the first dimension, Obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer;
在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,将所述目标数据的第一维度对应的最小颗粒度的k+1倍作为更新后的目标分块尺寸,返回至基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块的步骤,直至确定所述多个目标数据块和所述初始数据不满足设置的分块条件,将所述目标数据的第一维度对应的最小颗粒度的k倍确定为所述分块结果;In the case that it is determined that the multiple target data blocks and the initial data meet the set block conditions, take k+1 times of the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to the step of performing one-dimensional block on the target data according to the first dimension based on the target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set points. Block condition, determining k times of the minimum granularity corresponding to the first dimension of the target data as the block result;
在所述初始数据、和k等于1时生成的所述多个目标数据块不满足设置的分块条件的情况下,确定所述分块结果为一维分块失败。In a case where the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, it is determined that the block result is a one-dimensional block failure.
一种可能的实施方式中,在所述指定维度为二维,所述维度参数包括第二维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度、第二维度,对所述目标数据进行二维分块,得到二维分块结果,包括:In a possible implementation, when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension and the second dimension of the data, the two-dimensional block is performed on the target data, and the two-dimensional block result is obtained, including:
将所述目标数据的第一维度对应的最小颗粒度的y倍,确定为第一目标分块尺寸,基于所述第一目标分块尺寸,将所述目标数据按照第一维度进行一维分块,得到所述目标数据对应的多个中间数据块;其中,y为正整数;Determine y times of the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, divide the target data into one-dimensional division according to the first dimension. block, obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer;
将所述目标数据的第二维度对应的最小颗粒度的x倍,确定为第二目标分块尺寸;基于所述第二目标分块尺寸,将每个中间数据块按照所述第二维度进行二维分块,得到各个中间数据块分别对应的多个目标数据块;其中,x为正整数;Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform each intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;
在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,则将所述目标数据的第二维度对应的最小颗粒度的x+1倍作为更新后的第二目标分块尺寸,返回至基于所述第二目标分块尺寸,将每个中间数据块按照第二维度进行二维分块的步骤,直至确定所述多个目标数据块和所述初始数据不满足设置的分块条件为止,将所述目标数据的第二维度对应的最小颗粒度的x倍确定为所述分块结果。In the case that it is determined that the multiple target data blocks and the initial data satisfy the set block condition, then x+1 times the minimum granularity corresponding to the second dimension of the target data is used as the updated second The target block size, returning to the step of performing two-dimensional block on each intermediate data block according to the second dimension based on the second target block size, until it is determined that the multiple target data blocks and the initial data are not identical. Until the set block condition is satisfied, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.
一种可能的实施方式中,在所述待处理网络层对应的参数数据还包括输出数据的情况下,确定所述多个目标数据块和所述初始数据满足设置的分块条件,包括:In a possible implementation manner, when the parameter data corresponding to the network layer to be processed further includes output data, it is determined that the multiple target data blocks and the initial data satisfy the set block conditions, including:
在确定所述初始数据、所述输出数据、和每个目标数据块分别满足所述计算设备的内存要求,以及所述初始数据、所述输出数据、和每个目标数据块分别满足所述计算设备中DMA传输要求的情况下,确定所述多个目标数据块和所述初始数据满足设置的分块条件。After determining that the initial data, the output data, and each target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and each target data block satisfy the calculation In the case of DMA transfer requirements in the device, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模板可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。In some embodiments, the functions or templates included in the apparatus provided by the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments. For specific implementation, reference may be made to the above method embodiments. For brevity, here No longer.
基于同一技术构思,本公开实施例还提供了一种电子设备。参照图6所示,为本公开实施例提供的电子设备的结构示意图,包括处理器601、存储器602、和总线603。其中,存储器602用于存储执行指令,包括内存6021和外部存储器6022;这里的内存6021也称内存储器,用于暂时存放处 理器601中的运算数据,以及与硬盘等外部存储器6022交换的数据,处理器601通过内存6021与外部存储器6022进行数据交换,当电子设备600运行时,处理器601与存储器602之间通过总线603通信,使得处理器601在执行以下指令:Based on the same technical concept, an embodiment of the present disclosure also provides an electronic device. Referring to FIG. 6 , a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure includes a processor 601 , a memory 602 , and a bus 603 . Among them, the memory 602 is used to store the execution instructions, including the memory 6021 and the external memory 6022; the memory 6021 here is also called the internal memory, and is used to temporarily store the operation data in the processor 601 and the data exchanged with the external memory 6022 such as the hard disk, The processor 601 exchanges data with the external memory 6022 through the memory 6021. When the electronic device 600 is running, the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:
确定目标神经网络中的待处理网络层;Determine the network layers to be processed in the target neural network;
从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略;所述多个算子中的每个算子均用于实现所述待处理网络层对应的功能,所述多种分块策略中的每个分块策略均匹配用于运行所述目标神经网络的计算设备的运行要求;From the determined multiple operators and multiple block strategies, determine the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; each operator in the multiple operators is For realizing the function corresponding to the network layer to be processed, each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network;
基于所述待处理网络层对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。The target neural network including the target operator is run based on the target block strategy corresponding to the to-be-processed network layer.
此外,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的神经网络运行方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。In addition, an embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the execution of the neural network operation method described in the foregoing method embodiment is executed. step. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的神经网络运行方法的步骤,具体可参见上述方法实施例,在此不再赘述。Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program codes, and the instructions included in the program codes can be used to execute the steps of the neural network operation method described in the foregoing method embodiments. For details, please refer to the foregoing The method embodiments are not repeated here.
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。Wherein, the above-mentioned computer program product can be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the system and device described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here. In the several embodiments provided by the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. The apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-executable non-volatile computer-readable storage medium. Based on such understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the parts that contribute to the prior art or the parts of the technical solutions. The computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .
以上仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person skilled in the art who is familiar with the technical scope of the present disclosure can easily think of changes or substitutions, which should be covered within the scope of the present disclosure. within the scope of the present disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims (18)

  1. 一种神经网络运行方法,其特征在于,包括:A method for running a neural network, comprising:
    确定目标神经网络中的待处理网络层;Determine the network layers to be processed in the target neural network;
    从确定的多个算子和多种分块策略中,确定所述目标神经网络中所述待处理网络层对应的目标算子和目标分块策略;所述多个算子用于实现所述待处理网络层对应的功能,所述多种分块策略匹配用于运行所述目标神经网络的计算设备的运行要求;From the determined multiple operators and multiple block strategies, determine the target operator and the target block strategy corresponding to the to-be-processed network layer in the target neural network; the multiple operators are used to implement the The function corresponding to the network layer to be processed, the multiple block strategies match the operation requirements of the computing device used to run the target neural network;
    基于所述待处理网络层对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。The target neural network including the target operator is run based on the target block strategy corresponding to the to-be-processed network layer.
  2. 根据权利要求1所述的方法,其特征在于,所述分块策略用于对所述待处理网络层对应的目标算子的参数数据进行分块;The method according to claim 1, wherein the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed;
    在所述多种分块策略中,基于采用所述目标分块策略对所述目标算子的参数数据进行分块得到的参数数据,运行所述待处理网络层的资源消耗最小。Among the multiple blocking strategies, based on the parameter data obtained by segmenting the parameter data of the target operator by using the target blocking strategy, the resource consumption of running the network layer to be processed is minimal.
  3. 根据权利要求1或2所述的方法,其特征在于,在所述待处理网络层为多个的情况下,所述从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略,包括:The method according to claim 1 or 2, characterized in that, in the case that there are multiple network layers to be processed, the target is determined from multiple determined operators and multiple blocking strategies The target operator and target block strategy corresponding to the network layer to be processed in the neural network, including:
    针对所述目标神经网络中的任一待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略;For any to-be-processed network layer in the target neural network, determine a target candidate operator corresponding to the to-be-processed network layer from the multiple operators, and determine from the multiple block strategies The target candidate block strategy matched by the target candidate operator;
    在存在任一待处理网络层对应的所述目标候选算子为多个和/或所述目标候选分块策略为多个的情况下,基于所述待处理网络层对应的目标候选算子和目标候选分块策略,确定所述待处理网络层对应的所述目标算子和所述目标分块策略。In the case that there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators corresponding to the network layer to be processed and The target candidate block strategy is to determine the target operator corresponding to the to-be-processed network layer and the target block strategy.
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述待处理网络层对应的目标候选算子和目标候选分块策略,确定所述待处理网络层对应的所述目标算子和所述目标分块策略,包括:The method according to claim 3, wherein the target operator and the target candidate block strategy corresponding to the network layer to be processed are determined based on the target candidate operator and the target candidate block strategy corresponding to the network layer to be processed. The target block strategy includes:
    基于所述待处理网络层对应的目标候选算子、和与所述目标候选算子对应的目标候选分块策略,确定所述目标神经网络对应的多个测试网络;其中,所述多个测试网络中的任一测试网络包括所述待处理网络层对应的一个所述目标候选算子、和与该目标候选算子匹配的一个目标候选分块策略;Based on the target candidate operator corresponding to the network layer to be processed and the target candidate block strategy corresponding to the target candidate operator, a plurality of test networks corresponding to the target neural network are determined; Any test network in the network includes a target candidate operator corresponding to the network layer to be processed, and a target candidate block strategy matched with the target candidate operator;
    分别运行所述多个测试网络,得到多个测试结果,其中,所述多个测试网络中的任一测试网络对应一个测试结果;Running the multiple test networks respectively to obtain multiple test results, wherein any one of the multiple test networks corresponds to one test result;
    基于所述多个测试结果,从所述多个测试网络中选取目标测试网络;Selecting a target test network from the plurality of test networks based on the plurality of test results;
    将所述目标测试网络中所述待处理网络层的目标候选算子和目标候选分块策略,确定为所述目标神经网络中所述待处理网络层对应的所述目标算子和目标分块策略。Determining the target candidate operator and target candidate block strategy of the to-be-processed network layer in the target test network as the target operator and target block corresponding to the to-be-processed network layer in the target neural network Strategy.
  5. 根据权利要求3或4所述的方法,其特征在于,所述针对所述目标神经网络中的任一待处理网络层,从所述多个算子中确定所述待处理网络层对应目标候选算子、并从所述多种分块策略中确定与所述目标候选算子匹配的目标候选分块策略,包括:The method according to claim 3 or 4, wherein, for any to-be-processed network layer in the target neural network, the target candidate corresponding to the to-be-processed network layer is determined from the plurality of operators operator, and determine a target candidate block strategy matching the target candidate operator from the multiple block strategies, including:
    针对所述待处理网络层,从所述多个算子中,确定一个或多个第一候选算子;For the to-be-processed network layer, from the plurality of operators, determine one or more first candidate operators;
    基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略。Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.
  6. 根据权利要求5所述的方法,其特征在于,所述资源消耗情况用计算开销值表示,根据以下步骤确定所述第一候选算子在所述每种分块策略下的计算开销值:The method according to claim 5, wherein the resource consumption is represented by a computational cost value, and the computational cost value of the first candidate operator under each block strategy is determined according to the following steps:
    确定所述第一候选算子在预设尺寸下对应的受限场景,其中,所述受限场景为基于在预设尺寸下所述第一候选算子对应的数据容量的计算耗时和传输耗时确定的;Determine a restricted scenario corresponding to the first candidate operator under a preset size, wherein the restricted scenario is based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size time consuming;
    在所述受限场景属于带宽受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所 述第一候选算子在所述分块策略下对应的直接内存操作DMA数据传输总量、DMA任务个数、和数据转换开销;基于所述DMA数据传输总量、所述DMA任务个数、所述数据转换开销、和所述计算设备对应的DMA速率、DMA任务开销,确定所述第一候选算子在所述分块策略下的计算开销值;其中,所述数据转换开销为按照所述第一候选算子对应的目标数据排布方式,对所述第一候选算子对应的输入数据进行数据排布方式转换所消耗的时间;In the case where the restricted scenario is a bandwidth-restricted scenario, the direct memory operation DMA corresponding to the first candidate operator under the partitioning strategy is determined based on the partitioning result of the partitioning strategy. The total amount of data transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device , determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is calculated according to the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the candidate operator to convert the data arrangement;
    在所述受限场景属于计算受限场景的情况下,基于所述分块策略进行分块的分块结果,确定所述第一候选算子在所述分块策略下对应的参数数据的计算耗时、所述第一候选算子的算子调用次数、初始数据传输总量、DMA任务个数、和数据转换开销;基于所述计算耗时、所述算子调用次数、初始数据传输总量、数据转换开销、DMA任务开销、DMA任务个数、和所述计算设备对应的DMA速率,确定所述第一候选算子在所述分块策略下的计算开销值。In the case where the restricted scene belongs to a calculation-limited scene, based on the segmentation result of the segmentation strategy, determine the calculation of the parameter data corresponding to the first candidate operator under the segmentation strategy Time consumption, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; amount, data conversion overhead, DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, to determine the computing overhead value of the first candidate operator under the blocking strategy.
  7. 根据权利要求5或6所述的方法,其特征在于,所述基于所述第一候选算子在所述多种分块策略中的每种分块策略下的资源消耗情况,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的一个或多个目标候选分块策略,包括:The method according to claim 5 or 6, characterized in that, based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, from the first candidate operator In a candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target candidate operator. Block strategies, including:
    从所述第一候选算子对应的多个所述资源消耗情况中,选取满足预设条件的目标资源消耗情况;其中,一个第一候选算子在一种分块策略下对应一个所述资源消耗情况;A target resource consumption situation that satisfies a preset condition is selected from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, one first candidate operator corresponds to one of the resources under a block strategy consumption;
    将所述目标资源消耗情况对应的分块策略,确定为候选分块策略,并基于所述候选分块策略,运行包含所述目标资源消耗情况对应的第二候选算子的待处理网络层,确定与所述候选分块策略及所述第二候选算子对应的测试结果;Determining the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer containing the second candidate operator corresponding to the target resource consumption situation, determining a test result corresponding to the candidate block strategy and the second candidate operator;
    基于所述测试结果,确定所述待处理网络层对应的一个或多个目标候选算子以及与所述目标候选算子对应的目标候选分块策略。Based on the test result, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator are determined.
  8. 根据权利要求5~7任一所述的方法,其特征在于,从所述第一候选算子、和所述多种分块策略中,选择所述待处理网络层对应的一个或多个目标候选算子、以及与所述目标候选算子对应的目标候选分块策略之前,还包括:The method according to any one of claims 5 to 7, wherein one or more targets corresponding to the network layer to be processed are selected from the first candidate operator and the multiple blocking strategies Before the candidate operator and the target candidate block strategy corresponding to the target candidate operator, it also includes:
    基于确定的所述目标神经网络对应的最小颗粒度信息,对所述第一候选算子对应的参数数据进行对齐操作,得到所述第一候选算子对应的对齐后的参数数据;Based on the determined minimum granularity information corresponding to the target neural network, an alignment operation is performed on the parameter data corresponding to the first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator;
    其中,所述最小颗粒度信息中包括所述参数数据在不同维度下对应的最小颗粒度;所述对齐后的参数数据在不同维度下的尺寸,是所述最小颗粒度信息指示的对应维度下的最小颗粒度的整数倍。The minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions is the corresponding dimension indicated by the minimum granularity information. An integer multiple of the minimum particle size.
  9. 根据权利要求1~8任一所述的方法,其特征在于,在所述参数数据包括输入数据和常数数据的情况下,所述多种分块策略包括以下至少一种:The method according to any one of claims 1 to 8, wherein, in the case that the parameter data includes input data and constant data, the multiple block strategies include at least one of the following:
    将全部输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;所述初始数据为计算设备运行所述目标神经网络时,写入直接内存操作DMA任务所分配的初始数据区域内的数据;Taking all the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; the initial data is when the computing device runs the target neural network , write the data in the initial data area allocated by the direct memory operation DMA task;
    将全部常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;Taking all the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result;
    将部分输入数据作为所述初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;其中,部分输入数据的目标尺寸为根据所述输入数据的第一维度的最小颗粒度确定;Taking part of the input data as the initial data, based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; wherein, the target size of the part of the input data is according to the Minimum granularity determination of the first dimension of the input data;
    将部分常数数据作为所述初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;其中,部分常数数据的目标尺寸为根据所述常数数据的第一维度的最小颗粒度确定的。Taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is according to the The minimum granularity of the first dimension of the constant data is determined.
  10. 根据权利要求9所述的方法,其特征在于,所述将部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果,包括:The method according to claim 9, characterized in that, using part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result. ,include:
    基于所述输入数据的第一维度的最小颗粒度的i倍,确定所述部分输入数据的目标尺寸;determining the target size of the part of the input data based on i times the minimum granularity of the first dimension of the input data;
    分别将所述目标尺寸的所述部分输入数据作为初始数据,基于确定的所述常数数据的维度参数,对所述常数数据进行指定维度的分块,得到分块结果;The part of the input data of the target size is respectively used as initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result;
    其中,i为在确定了部分输入数据的目标尺寸后,使得所述部分输入数据的数据容量,以及基于所述常数数据的维度参数的最小颗粒度,确定的常数数据块的数据容量,满足计算设备的内存要求的正整数。Wherein, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the data capacity of the constant data block determined based on the minimum granularity of the dimension parameter of the constant data, which satisfies the calculation A positive integer for the memory requirement of the device.
  11. 根据权利要求9所述的方法,其特征在于,所述将部分常数数据作为初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果,包括:The method according to claim 9, characterized in that, taking part of the constant data as initial data, and based on the determined dimension parameter of the input data, the input data is divided into blocks of a specified dimension to obtain a block result. ,include:
    基于所述常数数据的第一维度的最小颗粒度的j倍,确定所述部分常数数据的目标尺寸;determining a target size of the portion of constant data based on j times the minimum granularity of the first dimension of the constant data;
    分别将所述目标尺寸的所述部分常数数据作为初始数据,基于确定的所述输入数据的维度参数,对所述输入数据进行指定维度的分块,得到分块结果;The part of the constant data of the target size is respectively used as initial data, and based on the determined dimension parameter of the input data, the input data is divided into blocks of a specified dimension to obtain a block result;
    其中,j为在确定了部分常数数据的目标尺寸后,使得所述部分常数数据的数据容量,以及基于所述输入数据的维度参数的最小颗粒度,确定的输入数据块的数据容量,满足计算设备的内存要求的正整数。Among them, j is the data capacity of the partial constant data after the target size of the partial constant data is determined, and the determined data capacity of the input data block based on the minimum granularity of the dimension parameter of the input data, which satisfies the calculation A positive integer for the memory requirement of the device.
  12. 根据权利要求9~11任一所述的方法,其特征在于,在所述指定维度为一维,所述维度参数包括第一维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度,对所述目标数据进行一维分块,得到分块结果,包括:The method according to any one of claims 9 to 11, wherein, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as Target data, based on the determined first dimension of the target data, perform one-dimensional segmentation on the target data to obtain a segmentation result, including:
    将所述目标数据的第一维度对应的最小颗粒度的k倍,确定为目标分块尺寸,基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块,得到所述目标数据对应的多个目标数据块;其中,k为正整数;Determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, one-dimensional block the target data according to the first dimension, Obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer;
    在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,将所述目标数据的第一维度对应的最小颗粒度的k+1倍作为更新后的目标分块尺寸,返回至基于所述目标分块尺寸,将所述目标数据按照所述第一维度进行一维分块的步骤,直至确定所述多个目标数据块和所述初始数据不满足设置的分块条件,将所述目标数据的第一维度对应的最小颗粒度的k倍确定为所述分块结果;In the case that it is determined that the multiple target data blocks and the initial data meet the set block conditions, take k+1 times of the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to the step of performing one-dimensional block on the target data according to the first dimension based on the target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set points. Block condition, determining k times of the minimum granularity corresponding to the first dimension of the target data as the block result;
    在所述初始数据、和k等于1时生成的所述多个目标数据块不满足设置的分块条件的情况下,确定所述分块结果为一维分块失败。In a case where the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, it is determined that the block result is a one-dimensional block failure.
  13. 根据权利要9~12任一所述的方法,其特征在于,在所述指定维度为二维,所述维度参数包括第二维度的情况下,分别将所述常数数据和所述输入数据作为目标数据,基于确定的所述目标数据的第一维度、第二维度,对所述目标数据进行二维分块,得到分块结果,包括:The method according to any one of claims 9 to 12, wherein when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as Target data, based on the determined first dimension and second dimension of the target data, perform two-dimensional segmentation on the target data to obtain a segmentation result, including:
    将所述目标数据的第一维度对应的最小颗粒度的y倍,确定为第一目标分块尺寸,基于所述第一目标分块尺寸,将所述目标数据按照第一维度进行一维分块,得到所述目标数据对应的多个中间数据块;其中,y为正整数;Determine y times of the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, divide the target data into one-dimensional division according to the first dimension. block, obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer;
    将所述目标数据的第二维度对应的最小颗粒度的x倍,确定为第二目标分块尺寸;基于所述第二目标分块尺寸,将至少一个中间数据块按照所述第二维度进行二维分块,得到各个中间数据块分别对应的多个目标数据块;其中,x为正整数;Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform at least one intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;
    在确定所述多个目标数据块和所述初始数据满足设置的分块条件的情况下,则将所述目标数据的第二维度对应的最小颗粒度的x+1倍作为更新后的第二目标分块尺寸,返回至基于所述第二目标分块尺寸,将至少一个中间数据块按照第二维度进行二维分块的步骤,直至确定所述多个目标数据 块和所述初始数据不满足设置的分块条件为止,将所述目标数据的第二维度对应的最小颗粒度的x倍确定为所述分块结果。In the case that it is determined that the multiple target data blocks and the initial data satisfy the set block condition, then x+1 times the minimum granularity corresponding to the second dimension of the target data is used as the updated second Target block size, returning to the step of performing two-dimensional block on at least one intermediate data block according to the second dimension based on the second target block size, until it is determined that the plurality of target data blocks and the initial data are not identical. Until the set block condition is satisfied, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.
  14. 根据权利要求12或13所述的方法,其特征在于,在所述待处理网络层对应的参数数据包括输出数据的情况下,所述确定所述多个目标数据块和所述初始数据满足设置的分块条件,包括:The method according to claim 12 or 13, wherein, in the case that the parameter data corresponding to the network layer to be processed includes output data, the determining that the multiple target data blocks and the initial data satisfy the setting block conditions, including:
    在确定所述初始数据、所述输出数据、和至少一个目标数据块分别满足所述计算设备的内存要求,以及所述初始数据、所述输出数据、和至少一个目标数据块分别满足所述计算设备中DMA传输要求的情况下,确定所述多个目标数据块和所述初始数据满足设置的分块条件。After determining that the initial data, the output data, and at least one target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and at least one target data block satisfy the calculation In the case of DMA transfer requirements in the device, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.
  15. 一种神经网络运行装置,其特征在于,包括:A device for running a neural network, comprising:
    第一确定模块,用于确定目标神经网络中的待处理网络层;The first determination module is used to determine the network layer to be processed in the target neural network;
    第二确定模块,用于从确定的多个算子和多种分块策略中,确定所述目标神经网络中待处理网络层对应的目标算子和目标分块策略;所述多个算子中用于实现所述待处理网络层对应的功能,所述多种分块策略匹配用于运行所述目标神经网络的计算设备的运行要求;The second determination module is configured to determine, from the determined multiple operators and multiple block strategies, the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; the multiple operators is used to realize the function corresponding to the network layer to be processed, and the multiple block strategies match the operation requirements of the computing device used to run the target neural network;
    运行模块,用于基于所述待处理网络层对应的所述目标分块策略,运行包含所述目标算子的所述目标神经网络。An operation module, configured to run the target neural network including the target operator based on the target block strategy corresponding to the network layer to be processed.
  16. 一种电子设备,其特征在于,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至14任一所述的神经网络运行方法的步骤。An electronic device, characterized in that it includes: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device runs, the processor and the memory are connected The machine-readable instructions are executed by the processor to perform the steps of the neural network operating method according to any one of claims 1 to 14.
  17. 一种计算机可读存储介质,其特征在于,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至14任一所述的神经网络运行方法的步骤。A computer-readable storage medium, characterized in that, a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the neural network operation method according to any one of claims 1 to 14 are executed .
  18. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1-14中的任一权利要求所述的方法。A computer program comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the code for implementing any one of claims 1-14. Methods.
PCT/CN2021/086229 2020-12-31 2021-04-09 Neural network operation method and apparatus, electronic device, and storage medium WO2022141924A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020227010736A KR20220098341A (en) 2020-12-31 2021-04-09 Neural network operation method, apparatus, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011619783.3 2020-12-31
CN202011619783.3A CN112668701B (en) 2020-12-31 2020-12-31 Neural network operation method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022141924A1 true WO2022141924A1 (en) 2022-07-07

Family

ID=75412062

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/086229 WO2022141924A1 (en) 2020-12-31 2021-04-09 Neural network operation method and apparatus, electronic device, and storage medium

Country Status (3)

Country Link
KR (1) KR20220098341A (en)
CN (1) CN112668701B (en)
WO (1) WO2022141924A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023150912A1 (en) * 2022-02-08 2023-08-17 华为技术有限公司 Operator scheduling operation time comparison method and device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185233A1 (en) * 2012-01-16 2013-07-18 Samsung Electronics Co., Ltd. System and method for learning pose classifier based on distributed learning architecture
CN110348562A (en) * 2019-06-19 2019-10-18 北京迈格威科技有限公司 The quantization strategy of neural network determines method, image-recognizing method and device
CN110796652A (en) * 2019-10-30 2020-02-14 上海联影智能医疗科技有限公司 Image processing method, computer device, and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599900B (en) * 2015-10-20 2020-04-21 华中科技大学 Method and device for recognizing character strings in image
CN110717905B (en) * 2019-09-30 2022-07-05 上海联影智能医疗科技有限公司 Brain image detection method, computer device, and storage medium
CN111179231B (en) * 2019-12-20 2024-05-28 上海联影智能医疗科技有限公司 Image processing method, device, equipment and storage medium
CN111179372B (en) * 2019-12-31 2024-03-26 上海联影智能医疗科技有限公司 Image attenuation correction method, image attenuation correction device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130185233A1 (en) * 2012-01-16 2013-07-18 Samsung Electronics Co., Ltd. System and method for learning pose classifier based on distributed learning architecture
CN110348562A (en) * 2019-06-19 2019-10-18 北京迈格威科技有限公司 The quantization strategy of neural network determines method, image-recognizing method and device
CN110796652A (en) * 2019-10-30 2020-02-14 上海联影智能医疗科技有限公司 Image processing method, computer device, and storage medium

Also Published As

Publication number Publication date
CN112668701A (en) 2021-04-16
KR20220098341A (en) 2022-07-12
CN112668701B (en) 2023-12-22

Similar Documents

Publication Publication Date Title
JP7366274B2 (en) Adaptive search method and device for neural networks
US20190212981A1 (en) Neural network processing unit including approximate multiplier and system on chip including the same
KR102521054B1 (en) Method of controlling computing operations based on early-stop in deep neural network
US10915812B2 (en) Method and system of managing computing paths in an artificial neural network
WO2020133317A1 (en) Computing resource allocation technology and neural network system
TW202147188A (en) Method of training neural network model and related product
KR102530548B1 (en) neural network processing unit
WO2022141924A1 (en) Neural network operation method and apparatus, electronic device, and storage medium
US11842220B2 (en) Parallelization method and apparatus with processing of neural network model for manycore system
KR20200062743A (en) Method and reconfigurable interconnect topology for multi-dimensional parallel training of convolutional neural network
US20150302022A1 (en) Data deduplication method and apparatus
WO2020133463A1 (en) Neural network system and data processing technology
WO2024051388A1 (en) Neural network on-chip mapping method and device based on tabu search algorithm
CN112836787A (en) Reducing deep neural network training times through efficient hybrid parallelization
CN111523642A (en) Data reuse method, operation method and device and chip for convolution operation
CN116762080A (en) Neural network generation device, neural network operation device, edge device, neural network control method, and software generation program
CN112306951B (en) CNN-SVM resource efficient acceleration architecture based on FPGA
KR102642333B1 (en) Method and apparatus for generating address of data of artificial neural network
CN115481717A (en) Method for operating neural network model, readable medium and electronic device
CN115103338A (en) Handing-over unloading method and device based on deep reinforcement learning in D2D environment
Huai et al. Collate: Collaborative neural network learning for latency-critical edge systems
KR102225745B1 (en) Method and system for probabilistic caching content under limited storage
CN112613594B (en) Algorithm arrangement method, device, computer equipment and storage medium
WO2020156212A1 (en) Data processing method and apparatus, and electronic device
US20230206113A1 (en) Feature management for machine learning system

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022519516

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912689

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/11/2023