WO2022141924A1

WO2022141924A1 - Neural network operation method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022141924A1
Application number: PCT/CN2021/086229
Authority: WO
Inventors: 徐磊
Original assignee: 上海商汤智能科技有限公司
Priority date: 2020-12-31
Filing date: 2021-04-09
Publication date: 2022-07-07
Also published as: CN112668701B; KR20220098341A; CN112668701A

Abstract

A neural network operation method and apparatus, an electronic device, and a storage medium. The method comprises: determining a network layer to be processed of a target neural network (S101); determining, from determined multiple operators and multiple chunking policies, a target operator and a target chunking policy respectively corresponding to the network layer to be processed of the target neural network (S102), each of the multiple operators being used for implementing a function corresponding to the network layer to be processed, and each of the multiple chunking policies matching an operating requirement of a computing device for operating the target neural network; and on the basis of the target chunking policy corresponding to the network layer to be processed, operating the target neural network comprising the target operator (S103).

Description

Neural network operation method, device, electronic device and storage medium

technical field

The present disclosure relates to the technical field of deep learning, and in particular, to a method, apparatus, electronic device and storage medium for operating a neural network.

Background technique

With the development of technology, large-scale neural networks have been applied in various scenarios, such as autonomous driving scenarios, image recognition scenarios, etc. After a large neural network is constructed, the large neural network can be run by a computing device.

SUMMARY OF THE INVENTION

In view of this, the present disclosure provides at least a method, apparatus, electronic device, and storage medium for running a neural network.

In a first aspect, the present disclosure provides a method for operating a neural network, including: determining a network layer to be processed in a target neural network; The target operator and target block strategy corresponding to the network layer to be processed; each operator in the multiple operators is used to implement the function corresponding to the network layer to be processed, and the multiple block strategies Each block strategy in the matching is used to run the operating requirements of the computing device of the target neural network; based on the target block strategy corresponding to the network layer to be processed, run the target neural network.

In the above method, after the network layer to be processed in the target neural network is determined, the target operator and the target partition strategy corresponding to the network layer to be processed can be determined from the determined multiple operators and multiple partition strategies, Since the block strategy meets the operating requirements of the computing device, the target neural network including the target operator can be run based on the target block strategy corresponding to the network layer to be processed, and the operating requirements of the computing device can be met. At the same time, since the target partitioning strategy can partition the parameter data of the target operator corresponding to the matching network layer to be processed, the resource consumption of running the network layer to be processed based on the partitioned parameter data is minimized. For example, the resource consumption can be It is represented by the total computational overhead, that is, while satisfying the operating requirements of the computing device, the efficiency of running the target neural network including the target operator based on the target block strategy corresponding to at least one network layer to be processed is relatively high.

In a possible implementation manner, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed; The block strategy divides the parameter data of the target operator into blocks to obtain the parameter data, and the resource consumption of running the to-be-processed network layer is minimal.

In a possible implementation manner, when the number of network layers to be processed is multiple, the network layer to be processed in the target neural network is determined from the determined multiple operators and multiple block strategies. The corresponding target operator and target block strategy include: for each to-be-processed network layer in the target neural network, determining the target candidate operator corresponding to the to-be-processed network layer from the plurality of operators, and A target candidate block strategy matching the target candidate operator is determined from the multiple block strategies; the target candidate operator corresponding to any network layer to be processed is multiple and/or the target When there are multiple candidate block strategies, the target operator and the target score corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed. block strategy.

In the above embodiment, the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy matching the target candidate operator can be respectively determined, so that the target candidate operator and the target candidate operator of each network layer to be processed can be realized. Local optimization of target candidate chunking strategies. Further, in the presence of multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators and targets corresponding to each network layer to be processed respectively. The candidate block strategy determines the target operator and target block strategy corresponding to each network layer to be processed, and realizes the global optimization of the target candidate operator and target candidate block strategy of each network layer to be processed.

In a possible implementation manner, the target operator and the target block corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed respectively. The strategy includes: determining a plurality of test networks corresponding to the target neural network based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, each Each test network includes a target candidate operator corresponding to each of the network layers to be processed, and a target candidate block strategy matching the target candidate operator; running the plurality of test networks respectively to obtain a plurality of test results, wherein each test network corresponds to a test result; based on the multiple test results, select a target test network from the multiple test networks; The candidate operator and the target candidate block strategy are determined as the target operator and the target block strategy corresponding to the to-be-processed network layer in the target neural network.

In the above embodiment, multiple test networks corresponding to the target neural network are determined based on the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy corresponding to the target candidate operator; A test network is used to determine the test results of each test network; based on the test results, the target test network is determined. For example, when the test result is the computational cost, the test network with the least computational cost can be selected as the target test network, and the target test network can be selected as the target test network. The target candidate operator and target candidate block strategy of each network layer to be processed are determined as the target operator and target block strategy corresponding to each network layer to be processed in the target neural network, and the target operator and target block strategy are realized. global preference.

In a possible implementation manner, for each to-be-processed network layer in the target neural network, a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple operators, and a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple operators. In the block strategy, determining a target candidate block strategy that matches the target candidate operator includes: for the to-be-processed network layer, from the plurality of operators, determining one or more first candidate operators; Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.

Here, for each network layer to be processed, after one or more first candidate operators corresponding to the network layer to be processed are determined, the first candidate operator may be based on each of the multiple partitioning strategies. Resource consumption under the strategy, from the first candidate operator and a variety of segmentation strategies, select one or more target candidate operators corresponding to the network layer to be processed, and target candidate segmentation corresponding to the target candidate operator Strategies, for example, the first candidate operator and block strategy with the least resource consumption can be selected as the target candidate operator and target candidate block strategy, which realizes the target candidate operator and target candidate split corresponding to each network layer to be processed. Local optimization of block strategies.

In a possible implementation manner, the resource consumption situation is represented by a computational cost value, and the computational cost value of the first candidate operator under each of the blocking strategies is determined according to the following steps: determining the first candidate operator A restricted scenario corresponding to the operator under a preset size, wherein the restricted scenario is determined based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size; In the case where the restricted scenario belongs to a limited bandwidth scenario, based on the partitioning result of the partitioning strategy, determine the direct memory operation DMA data corresponding to the first candidate operator under the partitioning strategy. The total amount of transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device, Determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is based on the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the operator to convert the data arrangement mode; in the case that the restricted scene is a computationally restricted scene, based on the segmentation result of the segmentation strategy, determine the first The calculation time of parameter data corresponding to a candidate operator under the block strategy, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; based on The calculation time-consuming, the number of operator calls, the total amount of initial data transmission, the data conversion overhead, the DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, determine the first candidate operator Computational cost value under the blocking strategy.

In the above embodiment, the restricted scenarios corresponding to the first candidate operator under the preset size may be determined, and different restricted scenarios correspond to different calculation cost value determination methods. For example, in a bandwidth-constrained scenario, the calculation overhead value can be determined based on the total amount of DMA data transmission, the number of DMA tasks, data conversion overhead, DMA rate, and DMA task overhead; , the number of operator calls, the total amount of initial data transfer, the data conversion overhead, the DMA task overhead, the number of DMA tasks, and the DMA rate to determine the calculation overhead value.

In a possible implementation manner, based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, from the first candidate operator and all Among the multiple blocking strategies, selecting one or more target candidate operators corresponding to the network layer to be processed and one or more target candidate blocking strategies corresponding to the target candidate operators, including: Among the multiple resource consumption situations corresponding to the first candidate operator, a target resource consumption situation that satisfies a preset condition is selected; wherein, one first candidate operator corresponds to one of the resource consumption situations under a block strategy ; Determine the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the network layer to be processed comprising the second candidate operator corresponding to the target resource consumption situation , determine the test result corresponding to the candidate block strategy and the second candidate operator; based on the test result, determine one or more target candidate operators corresponding to the network layer to be processed and the target operator corresponding to the target The target candidate block strategy corresponding to the candidate operator.

By adopting the above method, the resource consumption situation can be used first to select the second candidate operator and the candidate block strategy matching the second candidate operator from the first candidate operator and multiple block strategies; The candidate operator and the candidate block strategy are tested, and then the test results are used to determine at least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed, so that at least one target candidate operator corresponding to the determined network layer to be processed is determined. The sub and target candidate block strategy is the better choice.

In a possible implementation manner, from the first candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target Before the target candidate block strategy corresponding to the candidate operator, the method further includes: based on the determined minimum granularity information corresponding to the target neural network, performing an alignment operation on the parameter data corresponding to the first candidate operator to obtain the first candidate operator. Aligned parameter data corresponding to a candidate operator; wherein, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions, is an integer multiple of the minimum granularity in the corresponding dimension indicated by the minimum granularity information.

Here, based on the minimum granularity information corresponding to the target neural network, an alignment operation can be performed on the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator. The aligned parameter data is in The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information, which reduces the probability of loss of parameter data when running the target neural network based on the target block strategy.

In a possible implementation, when the parameter data includes input data and constant data, the multiple block strategies include at least one of the following: using all input data as initial data, and based on the determined constant The dimension parameter of the data, the constant data is divided into blocks of a specified dimension, and the block result is obtained; the initial data is the initial data area allocated by the direct memory operation DMA task when the computing device runs the target neural network All constant data is used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result; part of the input data is used as the initial data Data, based on the determined dimension parameter of the constant data, perform the block of the specified dimension on the constant data to obtain a block result; wherein, the target size of some input data is the smallest according to the first dimension of the input data. Granularity determination; taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is is determined according to the minimum granularity of the first dimension of the constant data.

In a possible implementation manner, taking part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including: i times the minimum granularity of the first dimension of the input data, determine the target size of the part of the input data; respectively take the part of the input data of the target size as the initial data, based on the determined dimension of the constant data parameter, carry out the block of the specified dimension to the constant data, and obtain the block result; wherein, i is the data capacity of the partial input data after determining the target size of the partial input data, and based on the constant data The minimum granularity of the dimension parameter, the data capacity of the determined constant data block, and a positive integer that meets the memory requirements of the computing device.

In a possible implementation manner, taking part of the constant data as initial data, and based on the determined dimension parameters of the input data, the input data is segmented with a specified dimension to obtain a segment result, including: based on the determined dimension parameters of the input data. Determine the target size of the partial constant data by j times the minimum granularity of the first dimension of the constant data; respectively take the partial constant data of the target size as initial data, based on the determined dimension of the input data parameter, the input data is divided into blocks of a specified dimension to obtain a block result; wherein, j is the data capacity of the part of the constant data after determining the target size of the part of the constant data, and based on the input data The minimum granularity of the dimension parameter, which determines the data capacity of the input data block, is a positive integer that meets the memory requirements of the computing device.

Here, setting up a variety of blocking strategies can enable each network layer to be processed to select an optimal target operator and a target blocking strategy that matches the target operator.

In a possible implementation, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension of the data, performing one-dimensional block on the target data to obtain a block result, including: determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, based on For the target block size, the target data is one-dimensionally divided according to the first dimension to obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer; after determining the plurality of target data blocks When the target data block and the initial data meet the set block conditions, take k+1 times the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to The target block size, the step of performing one-dimensional block on the target data according to the first dimension, until it is determined that the multiple target data blocks and the initial data do not meet the set block conditions, and the The k times of the minimum granularity corresponding to the first dimension of the target data is determined as the block result; the multiple target data blocks generated when the initial data and k are equal to 1 do not meet the set block conditions. In this case, it is determined that the block result is a one-dimensional block failure.

By using the above method, the size of the target block is continuously increased, and the block result that makes the memory usage rate of the computing device higher is determined by continuous attempts, which is beneficial to reduce the waste of memory resources of the computing device.

In a possible implementation, when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension and the second dimension of the data, performing two-dimensional segmentation on the target data to obtain a segmentation result, including: determining y times the minimum granularity corresponding to the first dimension of the target data as the first dimension The target block size, based on the first target block size, the target data is one-dimensionally divided according to the first dimension to obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer; Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform each intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer; when it is determined that the plurality of target data blocks and the initial data meet the set block conditions, Then take x+1 times of the minimum granularity corresponding to the second dimension of the target data as the updated second target block size, return to the second target block size based on the The step of performing two-dimensional partitioning in the second dimension, until it is determined that the multiple target data blocks and the initial data do not meet the set partitioning conditions, divide the minimum granularity x corresponding to the second dimension of the target data times are determined as the block result.

In a possible implementation manner, in the case that the parameter data corresponding to the network layer to be processed further includes output data, determining that the multiple target data blocks and the initial data meet the set block conditions, including: It is determined that the initial data, the output data, and each target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and each target data block satisfy the computing device, respectively In the case of the medium DMA transfer requirement, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.

Using the above method, when the initial data, the output data and each target data block meet the memory requirements of the computing device and the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions, It is ensured that the partitioning strategy matches the operating requirements of the computing device.

For descriptions of the effects of the following apparatuses, electronic devices, etc., reference may be made to the descriptions of the above-mentioned methods, which will not be repeated here.

In a second aspect, the present disclosure provides a neural network operating device, comprising: a first determination module for determining a network layer to be processed in a target neural network; a second determination module for selecting from the determined multiple operators and Among various block strategies, determine the target operator and target block strategy corresponding to the network layer to be processed in the target neural network; each operator in the multiple operators is used to realize the network to be processed. The function corresponding to the layer, each of the multiple block strategies matches the operation requirements of the computing device used to run the target neural network; the operation module is used to correspond to the network layers to be processed based on the corresponding The target partitioning strategy of , runs the target neural network including the target operator.

In a third aspect, the present disclosure provides an electronic device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device runs, the processor communicates with the The memories communicate with each other through a bus, and when the machine-readable instructions are executed by the processor, the steps of the method for operating a neural network according to the first aspect or any one of the implementation manners are executed.

In a fourth aspect, the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, executes the neural network described in the first aspect or any one of the embodiments above. The steps of the network operation method.

In a fifth aspect, the present disclosure provides a computer program comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the above method.

In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and easy to understand, the following specific embodiments are given and described in detail in conjunction with the accompanying drawings.

Description of drawings

In order to explain the technical solutions of the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings required in the embodiments, which are incorporated into the specification and constitute a part of the specification. The drawings illustrate embodiments consistent with the present disclosure, and together with the description serve to explain the technical solutions of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. Other related figures are obtained from these figures.

FIG. 1 shows a schematic flowchart of a method for operating a neural network provided by an embodiment of the present disclosure;

2 shows a schematic flowchart of determining a target operator and a target block strategy corresponding to a network layer to be processed in a target neural network in a method for operating a neural network provided by an embodiment of the present disclosure;

FIG. 3 shows that in a neural network operation method provided by an embodiment of the present disclosure, a target candidate operator corresponding to a network layer to be processed is determined from a plurality of operators, and a target candidate operator is determined from a variety of block strategies. A schematic flowchart of the sub-matching target candidate segmentation strategy;

FIG. 4 shows a schematic diagram of software and hardware scheduling of a computing device in a method for running a neural network provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of the architecture of a neural network operating apparatus provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments These are only some of the embodiments of the present disclosure, but not all of the embodiments. The components of the disclosed embodiments generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Therefore, the following detailed description of the embodiments of the disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure as claimed, but is merely representative of selected embodiments of the disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

Generally, for a computing device that relies on Direct Memory Access (DMA) for data transmission, the data cache of the computing device is inefficient or has no Data cache. Therefore, when using the computing device to infer large neural networks, Due to the limited memory of computing devices, problems such as tiling and scheduling of single-layer tasks of large neural networks may be encountered.

In the specific implementation, when the computing device runs the large-scale neural network, the official inference library set by the generation manufacturer of the computing device can be used to run the large-scale neural network on the computing device, but the official inference library is for a specific basic neural network. After the user optimizes the basic neural network, the official reasoning library may be unavailable, or the computing device may be less efficient to run the optimized neural network using the official reasoning library. Among them, the official reasoning library is an available reasoning deployment solution. For example, the official reasoning library can be the cdnn library of ceva dsp.

Further, for the optimized neural network, the official reasoning library can be re-developed, so that the developed reasoning library can be applied to the optimized neural network, but the development process has high cost and low efficiency. The reasoning library is only suitable for the optimized neural network, not for other neural networks, so that the reuse rate of the developed reasoning library is low.

Therefore, in order to solve the above problems, the embodiments of the present disclosure provide a method, apparatus, electronic device, and storage medium for running a neural network.

The defects existing in the above solutions are all the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions to the above problems proposed by the present disclosure hereinafter should be the inventors Contributions made to this disclosure during the course of this disclosure.

It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

In order to facilitate the understanding of the embodiments of the present disclosure, a method for operating a neural network disclosed in the embodiments of the present disclosure is first introduced in detail. The execution body of the neural network operating method provided by the embodiments of the present disclosure is generally a computer device with a certain computing capability, and the computer device may be a computing device running a neural network, or other computing devices. The computer device includes, for example: a terminal Equipment or server or other processing equipment, terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, Personal Digital Assistant (Personal Digital Assistant, PDA), handheld device, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementations, the neural network operating method may be implemented by a processor invoking computer-readable instructions stored in a memory.

Referring to FIG. 1, which is a schematic flowchart of a method for running a neural network provided by an embodiment of the present disclosure, the method includes S101-S103, wherein:

S101: Determine the network layer to be processed in the target neural network.

S102, from the determined multiple operators and multiple block strategies, determine a target operator and a target block strategy corresponding to the network layer to be processed in the target neural network.

Wherein, each of the multiple operators is used to implement the function corresponding to the network layer to be processed, and each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network .

S103, run a target neural network including a target operator based on the target block strategy corresponding to the network layer to be processed.

In a possible implementation manner, the target neural network is used to implement image processing tasks, and the image processing tasks include: at least one of image recognition, image classification, image segmentation, and key point detection.

S101-S103 will be specifically described below.

For S101: Here, the target neural network can be any neural network after graph-level optimization processing (ie, graph optimization processing), and the neural network after graph optimization processing is the neural network whose computation graph has been determined, that is, after graph optimization processing The neural network is a neural network in which the task and parameter size of each network layer have been determined, and the parameter size of each network layer may be the size of parameter data included in the network layer. For example, the task of the first network layer of the target neural network may be feature extraction. When the first network layer includes input data, the parameter size of the input data may be 256×256×128. The task and parameter size of each network layer can be set according to the actual situation, which is only an exemplary description here.

The to-be-processed network layer may be a to-be-processed network operation (operator, OP) layer on the target neural network. For example, the network layer to be processed may be the network OP layer in the target neural network whose size is greater than the set threshold; or, may also be the network OP layer selected by the user as required. The determined number of network layers to be processed may be one or more.

Exemplarily, each network OP layer can be approximated as a convolutional layer. For example, a fully connected layer can be approximated as a convolutional layer with the same feature map and convolution kernel size, and a conventional layer without weights can be used as weights. 0 convolutional layers, etc.

For S102: Here, when the number of network layers to be processed is multiple, a target operator and a target block strategy corresponding to the network layer to be processed may be determined for each network layer to be processed. Wherein, each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network; each of the multiple operators is used to implement the function corresponding to the network layer to be processed , each operator can correspond to an operation or a basic network structure unit. For example, the preset operator includes, for example, a convolution operator, a pooling operator, a fully connected operator, and the like. The computing device is a device that directly processes the inference calculation of the target neural network, for example, the computing device may be a digital signal processing (Digital Signal Processing, DSP) or the like.

In a possible implementation, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed; The parameter data obtained by partitioning, the resource consumption of running the network layer to be processed is minimal.

Here, the minimum resource consumption may refer to the minimum running time of running the network layer to be processed. During specific implementation, the target partitioning strategy of each network layer to be processed is used to partition the parameter data of the target operator corresponding to the network layer to be processed, so that the computing device runs each to-be-processed parameter data based on the partitioned parameter data. The resource consumption of processing the network layer is the smallest. For example, the resource consumption can be represented by the total computational cost, that is, the total computational cost of running each network layer to be processed is the smallest. The parameter data of the operator may include input and output data and constant data, the input and output data may include input data and output data, and the constant data may include weight data and/or deviation data.

Exemplarily, the input data may be three-dimensional data, for example, it may include a width dimension, a height dimension, and an input channel dimension; the output data may be three-dimensional data, for example, it may include an output width dimension, an output height dimension, and an output channel dimension; weight data It can be four-dimensional data, for example, it can include a width dimension, a height dimension, an input channel dimension, and an output channel dimension; the deviation data can be one-dimensional data, for example, it can include an output channel dimension. The dimension information of the above input data, output data, weight data, and deviation data may be set according to actual conditions, and this is only an exemplary description.

During specific implementation, when there are multiple network layers to be processed, each network layer to be processed in the target neural network can be determined layer by layer according to the order of the network layers to be processed in the target neural network. Alternatively, the target operator and target candidate block strategy of each to-be-processed network layer in each to-be-processed network layer may also be randomly determined. For example, when it is necessary to determine whether the data arrangement of the input data of the current network layer to be processed is consistent with the set target data arrangement, it is necessary to use the output data of the network layer to be processed before the current network layer to be processed. At this time, the target candidate operator and target candidate block strategy corresponding to each network layer to be processed need to be determined layer by layer.

In an optional implementation manner, as shown in FIG. 2 , in the case where there are multiple network layers to be processed, from the determined multiple operators and multiple block strategies, determine the pending processing in the target neural network. The target operator and target block strategy corresponding to the network layer, including:

S201, for each to-be-processed network layer in the target neural network, determine a target candidate operator corresponding to the to-be-processed network layer from a plurality of operators, and determine a target candidate matching the target candidate operator from a variety of blocking strategies Chunking strategy.

S202, when there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators and target candidates corresponding to each network layer to be processed respectively Block strategy, determine the target operator and target block strategy corresponding to each network layer to be processed.

Here, the target candidate operator corresponding to each network layer to be processed and the target candidate block strategy matching the target candidate operator can be determined first, and the target candidate operator and target candidate block of each network layer to be processed can be divided into blocks. Partial optimization of the strategy; further, when there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate block strategies, comprehensively consider each network layer to be processed, determine The target operator and target block strategy corresponding to each to-be-processed network layer in the target neural network respectively realize the global optimization of the target operator and target block strategy of each to-be-processed network layer.

Here, an operator set and a block strategy set may be preset, and the operator set includes all the set operators, and the block strategy includes all the set block strategies. In order to improve the efficiency of determining the target operator and the target blocking strategy of the network layer to be processed, for each network layer to be processed, a variety of algorithms corresponding to the network layer to be processed can be determined from the set of operators and the set of blocking strategies. Subs and multiple chunking strategies. Multiple operators corresponding to different network layers to be processed may be the same or different; and multiple block strategies corresponding to different network layers to be processed may be the same or different. Among them, multiple operators and multiple block strategies corresponding to each network layer to be processed can be determined according to the actual situation.

Exemplarily, multiple operators and/or multiple block strategies corresponding to each network layer to be processed may be determined based on historical experience data. For example, based on historical experience data, it can be determined that multiple operators corresponding to network layer 1 to be processed include operator 1, operator 2, and operator 3, and the corresponding multiple block strategies include block strategy 1 and block strategy 2 , block strategy 4; multiple operators corresponding to network layer 2 to be processed include operator 1, operator 3, operator 4, and operator 5, and the corresponding multiple block strategies include block strategy 2, block strategy five.

In S201, in an optional implementation manner, as shown in FIG. 3, for each network layer to be processed in the target neural network, a target candidate operator corresponding to the network layer to be processed is determined from a plurality of operators, and the target candidate operator corresponding to the network layer to be processed is determined from the multiple operators Determining the target candidate block strategy that matches the target candidate operator among multiple block strategies, including:

S301, for the network layer to be processed, from a plurality of operators, determine one or more first candidate operators.

S302, based on the resource consumption of the first candidate operator under each of the multiple segmentation strategies, select one corresponding to the network layer to be processed from the first candidate operator and the multiple segmentation strategies or multiple target candidate operators, and a target candidate block strategy corresponding to the target candidate operators.

Here, for each network layer to be processed, after determining one or more first candidate operators corresponding to the network layer to be processed, based on the resource consumption of the first candidate operator under each block strategy, From at least one first candidate operator and multiple block strategies, select one or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator. For example, you can Select the first candidate operator and the block strategy with the least resource consumption as the target candidate operator and the target candidate block strategy, and realize the local optimization of the target candidate operator and the target candidate block strategy corresponding to each network layer to be processed .

For S301, for each network layer to be processed in the target neural network, one or more first candidate operators corresponding to the network layer to be processed may be determined from a plurality of operators. For example, according to the task of each network layer to be processed, an operator that can complete the task can be selected from multiple operators as the first candidate operator corresponding to the network layer to be processed; According to the requirements of the neural network, determine the first candidate operator corresponding to the network layer to be processed.

For S302, the resource consumption of each first candidate operator under each block strategy may be determined first, and then based on the resource consumption of the first candidate operator under each of the multiple block strategies , from at least one candidate operator and multiple blocking strategies, determine one or more target candidate operators corresponding to the network layer to be processed, and a target candidate blocking strategy corresponding to the target candidate operator. The resource consumption status is the resource consumption data when the computing device runs the first candidate operator based on the block strategy. For example, the resource consumption status can be represented by a computing cost value, which indicates that the computing device runs a program including the target operator. The time elapsed while the network layer is pending.

Example 1, if the first candidate operator corresponding to the network layer 1 to be processed includes the first candidate operator 1 and the first candidate operator 2, the block strategy corresponding to the network layer to be processed includes the block strategy 1 and the block strategy 2. , Blocking strategy 3, then for the first candidate operator 1, the calculation cost value corresponding to the blocking strategy 1, the computing cost value corresponding to the blocking strategy 2, and the computing cost value corresponding to the blocking strategy 1 can be calculated. A candidate operator 2 can calculate the calculation cost value corresponding to the block strategy 1, the calculation cost value corresponding to the block strategy 2, and the calculation cost value corresponding to the block strategy 1, and then can calculate the calculated cost value based on the six calculated cost values. , and determine the target candidate operator and target candidate block strategy corresponding to the network layer 1 to be processed.

In an optional implementation manner, after obtaining multiple calculation cost values corresponding to each first candidate operator, the calculation cost value can be directly used to determine at least one target candidate operator and target candidate score corresponding to the network layer to be processed. block strategy.

For example, at least one target candidate operator and target candidate block strategy corresponding to the network layer to be processed can be determined by using the calculation cost value in the following two ways:

In the first mode, from each calculation cost value obtained by calculation, the first candidate operator and the block strategy with the smallest cost value are selected as the target candidate operator and the target candidate block strategy corresponding to the network layer to be processed.

Continuing with the above example 1, after obtaining 6 computation cost values, select the smallest computation cost value. The first is determined as the target candidate operator corresponding to the network layer 1 to be processed, and the block strategy 1 is determined as the target candidate block strategy corresponding to the network layer 1 to be processed.

In the second method, an overhead threshold may be set, and from the calculated multiple calculation overhead values corresponding to the network layer to be processed, a candidate overhead value whose calculation overhead value is less than the overhead threshold value is selected, and the first candidate operator corresponding to the candidate overhead value is determined. The target candidate operator corresponding to the network layer to be processed and the block strategy corresponding to the candidate cost value are determined as the target block strategy matched with the target candidate operator.

Continuing with the above example 1, after obtaining 6 computation cost values, if the computation cost value of the first candidate operator 1 under the block strategy is less than the set cost threshold, and the first candidate operator 2 is in the block strategy 3 If the computational cost value below is less than the set cost threshold, then the first candidate operator 1 is determined as a target candidate operator corresponding to the network layer 1 to be processed, and the blocking strategy 1 is determined as the one matching the first candidate operator 1. target candidate block strategy; and the first candidate operator 2 determines a target candidate operator corresponding to the network layer 1 to be processed, and the block strategy 3 is determined as the target candidate block strategy matched with the first candidate operator 3, That is, the target candidate operator and target candidate block strategy corresponding to the network layer 1 to be processed are determined.

In another embodiment, in S302, based on the resource consumption of the first candidate operator under each of the multiple segmentation strategies, from the first candidate operator and the multiple segmentation strategies, Select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate block strategies corresponding to the target candidate operators, including:

Step 1: Select a target resource consumption situation that satisfies a preset condition from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, a first candidate operator corresponds to a resource consumption under a block strategy Happening.

Step 2: Determine the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer including the second candidate operator corresponding to the target resource consumption situation, and determine the candidate block strategy. The test result corresponding to the blocking strategy and the second candidate operator.

Step 3: Determine one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator based on the test results.

In step 1, a first candidate operator corresponds to a resource consumption situation under a block strategy. For example, when the block strategy corresponding to the first candidate operator includes 4 types, the first candidate operator corresponds to 4 resource consumption situations.

The following description will be given by taking the resource consumption situation as an example to represent the calculation cost value, and the preset conditions can be set according to actual needs. For example, the preset condition may be the minimum overhead value; and/or the preset condition may be less than the set overhead threshold value; and/or the preset condition may also be selecting the minimum overhead value and the difference between the sum and the minimum overhead value The next smallest overhead value whose value is less than the set difference threshold.

For example, if the calculated cost values of the first candidate operator under the multiple set partitioning strategies include: cost value one is 10, cost cost two is 12, cost cost three is 18, cost cost value three is 18, and cost value three is 18. Four is 20, then the minimum cost value can be selected from multiple computational cost values, that is, the first computational cost value can be determined as the target cost value; or, the cost threshold can be set to 15, and the computational cost value 1 and the computational cost value 2 can be determined as the target. cost value; alternatively, a difference threshold of 5 can also be set, (it can be seen that the difference between the calculation cost value 2 and the calculation cost value 1 is less than the set difference threshold), and the calculation cost value 1 and the calculation cost value 2 are determined as the target The cost value, that is, the target cost value corresponds to the second candidate operator and the candidate block strategy matching the second candidate operator.

In step 2, the second candidate operator and the candidate block strategy corresponding to the target overhead value (that is, the target resource consumption condition) may be actually measured, and the test results corresponding to each target overhead value may be obtained. That is, for each target cost value, the block strategy corresponding to the target resource consumption can be determined as a candidate block strategy, and based on the candidate block strategy, the pending operation including the second candidate operator corresponding to the target cost value can be executed. The network layer determines the test result corresponding to the target cost value, that is, determines the test result corresponding to the candidate block strategy and the second candidate operator.

In step 3, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator may be determined based on the test results. For example, when the test result is the running time, the second candidate operator corresponding to the shortest running time can be selected as the target candidate operator of the network layer to be processed, and the candidate corresponding to the second candidate operator with the shortest running time can be divided into The block strategy is determined as the target candidate operator. The first candidate operator and the second candidate operator may be operators capable of realizing the function of the network layer to be processed.

Alternatively, a running time threshold can also be set, a test result smaller than the running time threshold is determined as a target test result, a second candidate operator corresponding to the target test result is determined as a target candidate operator, and a target candidate operator corresponding to the target test result is determined. The candidate block strategy is determined as the target candidate block strategy.

In an optional implementation manner, the resource consumption is represented by a calculation cost value, and the calculation cost value of the first candidate operator under each block strategy can be determined according to the following steps:

Step 1. Determine the restricted scene corresponding to the first candidate operator under the preset size, wherein the restricted scene is the calculation time and transmission consumption based on the data capacity corresponding to the first candidate operator under the preset size. time determined;

Step 2: In the case where the restricted scenario belongs to the limited bandwidth scenario, based on the partitioning result of the partitioning strategy, determine the total amount of direct memory operation DMA data transmission corresponding to the first candidate operator under the partitioning strategy, The number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device, determine the first candidate operator under the block strategy. Calculate the cost value; wherein, the data conversion cost is the time consumed for converting the input data corresponding to the first candidate operator according to the target data arrangement mode corresponding to the first candidate operator;

Step 3: In the case where the restricted scene is a computationally restricted scene, the result of performing the segmentation based on the segmentation strategy is used to determine the calculation time-consuming and the first parameter data corresponding to the first candidate operator under the segmentation strategy. Number of operator calls, total initial data transfer, number of DMA tasks, and data conversion overhead of candidate operators; based on calculation time, number of operator calls, total initial data transfer, data conversion overhead, DMA task overhead, DMA The number of tasks and the DMA rate corresponding to the computing device determine the computing overhead value of the first candidate operator under the block strategy.

In step 1, a restricted scene corresponding to each first candidate operator under a preset size may be determined. The preset size may be a larger size set as required.

During specific implementation, the parameter data of the target operator corresponding to each network layer to be processed can be stored in other external memory other than the memory of the computing device, for example, can be stored in the double-rate synchronous dynamic random access memory (Double Data Rate, DDR), when running each network layer to be processed, the DMA can obtain the parameter data (such as input data, constant data, etc.) of the target operator corresponding to the network layer to be processed from the DDR, and will The parameter data is transferred to the memory of the computing device. After the computing device completes the calculation, the DMA transfers the data result (ie the output data) to the DDR, so that the next network layer adjacent to the network layer to be processed in the target neural network (The network layer can be the network layer to be processed). The DMA may use the ping-pong scheduling strategy to transmit the acquired parameter data.

It can be seen from this that there is a time-consuming transmission when the DMA transmits the parameter data of the first candidate operator under the preset size, and the computing device has a time-consuming time when processing the parameter data of the first candidate operator under the preset size. Computational time. Furthermore, if the calculation time is longer than the transmission time, it means that when the DMA transfers the current parameter data to the memory of the computing device, the processing of the previous parameter data by the computing device has not ended. At this time, the computing device needs to wait. After the device has finished processing the previous parameter data, it transfers the current parameter data to the memory of the computing device. The scenario corresponding to this situation is a computing-limited scenario; if the computing time is less than the transmission time, it means that the computing device will After the previous parameter data processing is completed, the DMA has not yet transferred the current parameter data to the memory of the computing device. The DMA needs to wait until the current parameter data transmitted by the DMA is received. The scenario corresponding to this situation can be a bandwidth-limited scenario. .

Furthermore, when the limited scene is a bandwidth-limited scene, the cost value can be calculated by using the first cost function corresponding to the bandwidth-limited scene; when the limited scene is a calculation-limited scene, the second cost function corresponding to the calculation-limited scene can be used. Calculate the cost value.

Exemplarily, the restricted scene corresponding to the first candidate operator under the preset size may be determined according to the following process: for the first candidate operator under the preset size, determine to transmit the first candidate operator under the preset size. The transmission time required for the corresponding parameter data, and the calculation time required for determining the computing device to process the parameter data of the first candidate operator under the preset size, and the first candidate is determined according to the transmission time and the calculation time. The restricted scenario corresponding to the operator.

Exemplarily, the restricted scene of the first candidate operator under the preset size can also be determined according to the following process: first, based on the preset size information, it is determined that the computing device runs the corresponding operation based on the parameter data corresponding to the first candidate operator. The target time-consuming required by the network layer to be processed is determined, and the target data capacity of the parameter data corresponding to the first candidate operator is determined. Second, based on the DMA rate corresponding to the computing device and the target time-consuming, determine the data capacity that can be transmitted by the DMA within the target time-consuming. Third, based on the ratio of the data capacity that can be transmitted by DMA to the target data capacity within the target time, determine the restricted scenario, that is, when the ratio is less than or equal to 1, the restricted scenario is determined to be a bandwidth-restricted scenario; when the ratio is greater than 1 When , the restricted scene is determined to be a computationally restricted scene.

Here, the data capacity that can be transmitted by DMA in the target time is related to the transmission speed, and the target data capacity is related to the calculation speed. When the ratio is greater than 1, it indicates that the transmission speed is greater than the calculation speed (that is, the transmission time is less than the calculation time), which is Computation-limited scenarios; when the ratio is less than or equal to 1, it indicates that the transmission speed is less than or equal to the calculation speed (that is, the transmission time is greater than or equal to the calculation time), which is a bandwidth-limited scenario, and then for different restricted scenarios, You can choose different ways to determine the computational overhead value.

The target time consumption corresponding to the parameter data of the first candidate operator on the computing device can be determined based on the preset size information of the parameter data of the first candidate operator, that is, it is determined that the computing device is based on the parameter data corresponding to the first candidate operator. The desired target time-consuming when running the corresponding pending network. Then, the DMA rate corresponding to the computing device and the target time-consuming can be multiplied to obtain the data capacity that can be transmitted by the DMA within the target time-consuming.

Meanwhile, the target data capacity of the parameter data corresponding to the first candidate operator may be determined based on the preset size information of the parameter data of the first candidate operator. For example, when the first candidate operator is a convolution operator, the target data capacity may be the sum of constant data (including weight data and bias data), output data and input data. The restricted scenario can then be determined based on the ratio of the data capacity that can be transmitted by the DMA within the calculated target time-consuming to the target data capacity.

In specific implementation, after the computing device is determined, the DMA task overhead corresponding to the computing device can be determined, in seconds (s). For example, the cycle required to create a DMA task can be converted into time, that is, DMA task overhead; and the DMA rate, that is, the DMA transfer rate, can be determined in Byte/s.

In step 2, the calculation cost value of the first candidate operator under the block strategy may be determined by using the first cost function. The first overhead function may be: calculation overhead=total amount of DMA data transmission/DMA rate+number of DMA tasks×DMA task overhead+data conversion overhead.

That is, when it is determined that the bandwidth is limited, the total amount of DMA data transmission (unit is Byte), the number of DMA tasks, and the data conversion overhead ( The unit is seconds). Among them, the total amount of DMA data transmission can be determined according to the generated DMA tasks; the number of DMA tasks can be determined based on the number of data blocks of the obtained parameter data after the parameter data is divided into blocks based on the block strategy; When one data block corresponds to one DMA task and the number of generated data blocks is 10, it is determined that there are 10 DMA tasks. Here, the total amount of DMA data transmission and the number of DMA tasks may be determined according to actual conditions, and this is only an exemplary description. For example, when the first candidate operator is a convolution operator, the number of DMA tasks obtained after the block result can be determined according to convolution parameters such as the convolution kernel size and convolution stride corresponding to the convolution operator.

The data conversion overhead is the time consumed for converting the data layout of the input data corresponding to the first candidate operator according to the target data layout corresponding to the first candidate operator. Here, when the data arrangement of the input data of the first operator is consistent with the arrangement of the target data corresponding to the first operator, the data conversion overhead is 0; When the target data arrangement modes corresponding to an operator are inconsistent, the data conversion overhead can be calculated according to the following formula: data conversion overhead=total data capacity of input data×2/DMA rate. Among them, the total data capacity of input data is all input data input to the network layer to be processed before being divided into blocks.

In step 3, when it is determined that the calculation is limited, the calculation cost value of the first candidate operator under the block strategy may be calculated according to the second cost function. The second overhead function is: calculation overhead=operator overhead converted bandwidth×operator invocation times/DMA rate+initial data transmission total amount/DMA rate+DMA task number×DMA task overhead+data conversion overhead.

The operator overhead conversion width is the amount of operator transmission data determined based on the calculation time of the first candidate operator under the preset size and the size of the parameter data corresponding to the first selected operator under the block strategy. For example, when the preset size is 1024×1024×128, the calculation time of the first candidate operator under the preset size is 10 milliseconds, and the size of the segmented parameter data is 512×512×64, the first candidate operator The calculation time of the parameter data corresponding to the operator under the block strategy is 1.25 milliseconds. Then, based on the determined calculation speed and the calculation time (for example, 1.25 milliseconds) of the parameter data corresponding to the first candidate operator under the block strategy, determine the operator overhead corresponding to the first candidate operator after the block. Converted bandwidth.

Specifically, the calculation time of the parameter data corresponding to the first candidate operator under the segmentation strategy, the number of operator calls of the first candidate operator, the total amount of initial data transmission, and the number of DMA tasks can be determined based on the segmentation result. , and data conversion overhead. The number of operator calls can be determined based on the number of data blocks of the obtained parameter data after the parameter data is divided into blocks based on the block strategy; for example, if the number of data blocks of the obtained parameter data is 10, then The number of operator calls is determined to be 10 times; the total amount of initial data transmission is the data capacity of the initial data determined based on the block strategy; the target data capacity, the number of operator calls, and the total amount of initial data transmission can be determined according to the actual situation .

Wherein, in step 2 and step 3, based on the block result, the target data capacity corresponding to the aligned parameter data of the first candidate operator, the number of operator calls, the total amount of initial data transmission, the number of DMA tasks, and Data conversion overhead.

The process of determining the data conversion overhead in step 3 is the same as the process of determining the data conversion cost in step 2, and will not be described in detail here. The embodiment of the present disclosure can be mainly applied to the bandwidth-limited scenario, that is, when the bandwidth-limited scenario is satisfied, use step 2 to determine the calculation cost value; when the bandwidth-limited scenario is not satisfied (ie, when the computational cost is satisfied), the Step 3 determines the calculation cost value.

In the above embodiment, the restricted scenarios corresponding to the first candidate operator under the preset size may be determined, and different restricted scenarios correspond to different calculation cost value determination methods. For example, in a bandwidth-constrained scenario, the calculation overhead value can be determined based on the total amount of DMA data transmission, the number of DMA tasks, data conversion overhead, DMA rate, and DMA task overhead; , the number of operator calls, the total amount of initial data transfer, the data conversion overhead, and the DMA rate to determine the calculation overhead value.

In an optional implementation manner, one or more target candidate operators corresponding to the network layer to be processed and target candidate operators corresponding to the target candidate operators are selected from the first candidate operator and multiple blocking strategies Before the chunking strategy, the method also includes:

Based on the determined minimum granularity information corresponding to the target neural network, an alignment operation is performed on the parameter data corresponding to the first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator; wherein, the minimum granularity information includes parameters The minimum granularity corresponding to the data in different dimensions; the size of the aligned parameter data in different dimensions is an integer multiple of the minimum granularity in the corresponding dimension indicated by the minimum granularity information.

Here, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions. For example, when the parameter data includes weight data, the minimum granularity information corresponding to the weight data includes the minimum granularity in the width dimension and the length dimension. Minimum granularity on the input channel dimension, and minimum granularity on the output channel dimension. Wherein, the minimum granularity information may be determined according to the operation requirement of the computing device and/or the user requirement, which is only an exemplary description here.

The minimum granularity information corresponding to the determined target neural network can be used to align the parameter data corresponding to each first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator, so that the aligned parameter data The size in different dimensions is an integer multiple of the minimum particle size in the corresponding dimension indicated by the minimum particle size information. For example, if the size of the width dimension indicated by the minimum granularity information is 32, and the size of the width dimension indicated by the parameter data is 33, the size of the width dimension of the generated aligned parameter data is 64; When the size of the width dimension is 31, the size of the width dimension of the generated aligned parameter data is 32.

The specific process of the alignment operation can be selected according to actual needs. For example, conventional data alignment methods (such as padding methods, etc.) can be used to align parameter data to generate aligned parameter data.

In another embodiment, the computing device can also obtain the parameter data before alignment from the DDR, use garbage data to calculate, and then select valid data from the data output by the computing device, and input the valid data as output data to the DDR middle.

For S202: In an optional implementation manner, in S202, the target operator and the target block corresponding to each network layer to be processed are determined based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed respectively strategies, including:

S2021: Determine multiple test networks corresponding to the target neural network based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, each test network includes each A target candidate operator corresponding to the network layer to be processed, and a target candidate block strategy matching the target candidate operator.

S2022, run multiple test networks respectively to obtain multiple test results, wherein each test network corresponds to one test result.

S2023, based on multiple test results, select a target test network from multiple test networks.

S2024: Determine the target candidate operator and the target candidate block strategy of the network layer to be processed in the target test network as the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network respectively.

In S2021, for example, if the target neural network includes a first network layer to be processed, a second network layer to be processed and a third network layer to be processed, the first network layer to be processed includes target candidate operator 1, target candidate The block strategy 1 corresponding to operator 1, and the block strategy 2 corresponding to target candidate operator 2 and target candidate operator 2; the second network layer to be processed includes target candidate operator 3 and target candidate operator 3. Block strategy 1, target candidate operator 4, and block strategy 1 corresponding to target candidate operator 4; the third to-be-processed network layer includes target candidate operator 5 and block strategy 3 corresponding to target candidate operator 5.

Then, four test networks corresponding to the target neural network can be obtained. The first test network includes: target candidate operator 1, target candidate operator 1 corresponding block strategy 1, target candidate operator 3, target candidate operator corresponding to Block strategy 1, target candidate operator 5, and block strategy 3 corresponding to target candidate operator 5. The second test network includes: target candidate operator 1, block strategy 1 corresponding to target candidate operator 1, target candidate operator 4, block strategy 1 corresponding to target candidate operator 4, target candidate operator 5, target Blocking strategy 3 corresponding to candidate operator 5. The third test network includes: target candidate operator 2, block strategy 2 corresponding to target candidate operator 2, target candidate operator 3, block strategy 1 corresponding to target candidate operator, target candidate operator 5, target candidate The block strategy 3 corresponding to operator 5. The fourth test network includes: target candidate operator 2, block strategy 2 corresponding to target candidate operator 2, target candidate operator 4, block strategy 1 corresponding to target candidate operator 4, target candidate operator 5, target Blocking strategy 3 corresponding to candidate operator 5.

In S2022 and S2023, the computing device may be controlled to run a plurality of test networks respectively, and a test result of each test network may be determined. For example, the test result can be the corresponding running time of each test network. Then, based on the test results corresponding to the multiple test networks, the target test network can be selected from the multiple test networks. For example, the test network with the shortest running time can be selected as the target test network.

In S2024, the target candidate operator and target candidate block strategy of each network layer to be processed included in the target test network may be determined as the target operator and target block corresponding to each network layer to be processed in the target neural network respectively Strategy.

For example, if it is determined that the second test network is the target test network, then the target candidate operator 1 is determined to be the target operator of the first network layer to be processed, and the block strategy 1 is the target block strategy corresponding to the first network layer to be processed; Target candidate operator 4 is the target operator of the second network layer to be processed, block strategy 1 is the target candidate block strategy of the second network layer to be processed; target candidate operator 5 is the target operator of the third proxy network layer , and the third block strategy is the target candidate block strategy of the third to-be-processed network layer.

In order to reduce the cost and computing resources consumed by running the test network, and improve the determination efficiency of target operators and target block strategy, in the specific implementation, you can set the target algorithm that matches the target block strategy corresponding to each network layer to be processed. Maximum number of children. For example, when the maximum number is set to 2, each network layer to be processed can include a target operator matching the target block strategy, for example, target operator 1 matching the target block strategy 1; or, each The network layer to be processed may include two target operators that match the target block strategy. For example, the two target operators that match the target block strategy can be: target operators that match the target block strategy 1. Match Target operator 1 with target blocking strategy 2; or, two target operators matching with target blocking strategy can be: target operator 1 matching with target blocking strategy 1, and matching target blocking strategy 2 with target operator 1 The second target operator; alternatively, the two target operators matched with the target block strategy may be: target operator 1 matched with the target block strategy 1, target operator 2 matched with the target block strategy 1, and so on.

And/or, during specific implementation, a threshold for the number of test networks corresponding to the target neural network may be set. For example, the set number threshold is 100, and the network layer to be processed includes 10 layers. If the number of target operators matching the target block strategy corresponding to the network layer to be processed from the first layer to the network layer to be processed is 2 , then in the first to sixth to-be-processed network layers, based on the target operator and target block strategy corresponding to each to-be-processed network layer, the number of formed local test networks can be 2 ⁶ =64 indivual. Further, when determining the target operator and the target block strategy of the seventh layer to be processed network layer, if the number of target operators matched with the target block strategy corresponding to the seventh layer to be processed network layer is 2, then In the first network layer to be processed to the seventh network layer to be processed, based on the target operator and target block strategy corresponding to each network layer to be processed, the number of formed local test networks may be 2 ⁷ =128, is greater than the set number threshold; in this case, among the seventh unprocessed network layer, the eighth unprocessed network layer, the ninth unprocessed network layer, and the tenth unprocessed network layer, each unprocessed The number of target operators that match the target block strategy corresponding to the network layer can only be one.

In the above embodiment, multiple test networks corresponding to the target neural network are determined based on at least one target candidate operator corresponding to each network layer to be processed and the target candidate block strategy corresponding to the target candidate operator; Run multiple test networks and determine the test results of each test network; determine the target test network based on the test results. For example, when the test result is computational cost, you can select the test network with the least computational cost as the target test network, and use the target test network The target candidate operator and target candidate block strategy of each network layer to be processed in the network are determined as the target operator and target block strategy corresponding to each network layer to be processed in the target neural network, and the target operator and target partition strategy are realized. Global preference for block strategy.

In an optional embodiment, when the specified dimension is one dimension, the dimension parameter is the first dimension; when the specified dimension is N dimension, the dimension parameter includes the first dimension to the Nth dimension, and N is greater than 2 and less than constant data. or the dimension of the input data. In the case where the parameter data includes input data and constant data, the multiple chunking strategies include at least one of the following:

Scheme 1. Take all input data as initial data, and perform one-dimensional block on the constant data based on the first dimension of the determined constant data to obtain the block result; the initial data is written to the direct memory when the computing device runs the target neural network. Operates the data in the initial data area allocated by the DMA task.

Scheme 2: Using all input data as initial data, and based on the determined first dimension and second dimension of the constant data, perform two-dimensional block on the constant data to obtain a block result.

Scheme 3: Use all constant data as initial data, and perform one-dimensional block on the input data based on the determined first dimension of the input data to obtain a block result.

Scheme 4: Using all constant data as initial data, and based on the determined first dimension and second dimension of the input data, perform two-dimensional block on the input data to obtain a block result.

Scheme 5: Taking part of the input data as the initial data, and based on the determined first dimension of the constant data, one-dimensional block the constant data to obtain the block result; wherein, the target size of the part of the input data is based on the first dimension of the input data. The minimum granularity of the dimension is determined.

Option 6: Taking part of the input data as the initial data, and based on the determined first dimension and second dimension of the constant data, perform two-dimensional block on the constant data to obtain a block result; wherein, the target size of part of the input data is based on the input data. The minimum granularity of the first dimension of the data is determined.

Scheme 7: Use part of the constant data as initial data, and based on the determined first dimension of the input data, perform one-dimensional block on the input data to obtain a block result; wherein, the target size of the part of the constant data is the first dimension according to the constant data. The minimum granularity of the dimension is determined.

Scheme 8: Using part of the constant data as initial data, and based on the determined first dimension and second dimension of the input data, perform a two-dimensional block on the input data to obtain a block result; wherein, the target size of the part of the constant data is based on the constant The minimum granularity of the first dimension of the data is determined.

Here, all input data can be used as initial data, and the initial data application space can be allocated in the initial data area. Then, based on the determined first dimension of the constant data, one-dimensional block is performed on the constant data to obtain a block result. Alternatively, based on the determined first dimension and the second dimension of the constant data, two-dimensional block is performed on the constant data to obtain a block result.

All constant data can be used as initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data to obtain a block result. Alternatively, based on the determined first dimension and second dimension of the input data, two-dimensional block is performed on the input data to obtain a block result.

Part of the input data can also be used as the initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data to obtain a block result; or, based on the determined first dimension and the second dimension of the input data , the input data is divided into two-dimensional blocks, and the block results are obtained.

In an optional embodiment, in Scheme 5 or Scheme 6, a part of the input data is used as initial data, and based on the dimension parameter of the determined constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including:

1. Determine the target size of part of the input data based on i times the minimum granularity of the first dimension of the input data.

2. Part of the input data of the target size is respectively used as the initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of the specified dimensions, and the block result is obtained.

Among them, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the minimum granularity of the dimension parameter based on the constant data, the determined data capacity of the constant data block, which meets the memory requirements of the computing device positive integer of .

Here, the maximum value of i can be determined in an incremental manner. The following uses scheme 5 as an example to illustrate (that is, one-dimensional block as an example), i is incremented from 1, that is, when i=1, the target size of some input data is the smallest of the first dimension of the input data. When the particle size is 1 times, the part of the input data of the target size is used as the initial data, and the constant data is divided into one-dimensional blocks based on the first dimension of the determined constant data, and the one-dimensional block result is obtained.

When the one-dimensional block result corresponding to i=1 indicates that the constant data allocate fails, the solution 5 is an unavailable solution; when the one-dimensional block result corresponding to i=1 indicates that the constant data allocate is successful, the value of i Add 1 (to get i=2), and return to the step of determining the target size of the partial input data, that is, the target size is twice the minimum granularity of the first dimension of the input data, and the partial input data of the target size is used as the initial data , perform one-dimensional block on the constant data based on the determined first dimension of the constant data, and obtain a one-dimensional block result. When the one-dimensional block result corresponding to i=2 indicates that the constant data allocate fails, the maximum value of i is determined to be 1, and the increment process ends; when the one-dimensional block result indicates that the constant data allocate is successful, the value of i is incremented by 1 (i=3 at this time), return to the step of determining the target size of part of the input data again, until the one-dimensional block result indicates that the constant data allocate fails. For example, if it is determined that the one-dimensional block result indicates that the constant data allocate fails when i=6, the maximum value of i is determined to be 5. When the maximum value of i is 5, this scheme can obtain 5 block results.

The block result indicating that the constant data allocate fails may be that after the constant data is divided according to the minimum granularity of the first dimension, the obtained constant data block and initial data do not meet the memory requirements of the computing device. If the scheduling policy is ping-pong scheduling, the input data allocate fails when twice the data capacity of the constant data block divided according to the minimum granularity of the first dimension is larger than the memory of the scheduling area of the computing device.

For example, when the maximum value of i is 5, the following five block strategies can be included in Scheme 5:

Method 1: Determine 1 time of the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as the initial data, and based on the determined first dimension of the constant data, perform a one-dimensional analysis of the constant data. Block, get a one-dimensional block result;

Method 2: Determine twice the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as initial data, and perform a one-dimensional analysis of the constant data based on the determined first dimension of the constant data. Block, get a one-dimensional block result;

...

Method 5: Determine 5 times the minimum particle size of the first dimension of the input data as the target size of part of the input data, take part of the input data as initial data, and perform one-dimensional analysis on the constant data based on the determined first dimension of the constant data. Block to get a one-dimensional block result.

Part of the constant data may also be used as initial data, and based on the determined first dimension of the input data, one-dimensional block is performed on the input data.

In an optional embodiment, in Scheme 7 or Scheme 8, some constant data is used as initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension, and a block result is obtained, including:

1. Determine the target size of part of the constant data based on j times the minimum particle size of the first dimension of the constant data;

2. Part of the constant data of the target size is respectively used as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of the specified dimensions, and the block result is obtained.

Here, the maximum value of j can be determined in an incremental manner. The following is an example of solution 7, where j starts to increase from 1, that is, when j=1, the target size of some input data is 1 times the minimum granularity of the first dimension of the input data, and the target size is Part of the input data is used as initial data, and one-dimensional block is performed on the constant data based on the determined first dimension of the constant data to obtain a one-dimensional block result.

When the one-dimensional block result corresponding to j=1 indicates that the input data allocate fails, then the seventh scheme is an unavailable scheme; when the one-dimensional block result corresponding to j=1 indicates that the input data allocate is successful, the value of j Add 1 (getting j=2), and return to the step of determining the target size of the partial input data until the resulting one-dimensional block result indicates that the input data allocate failed. For example, if it is determined that the obtained one-dimensional block result indicates that the input data allocate fails when j=6, the maximum value of j is determined to be 5. When the maximum value of j is 5, this scheme can obtain 5 block results.

The block result indicating that the input data allocate fails may be that after the input data is divided according to the minimum granularity of the first dimension, the obtained input data block and initial data do not meet the memory requirements of the computing device. If the scheduling policy is ping-pong scheduling, the input data allocate fails when twice the data capacity of the input data block divided according to the minimum granularity of the first dimension is larger than the memory of the scheduling area of the computing device. For example, if the initial data, scheduling data ping (input data block divided according to the minimum granularity of the first dimension), and scheduling data pong (input data block divided according to the minimum granularity of the first dimension) do not satisfy the computing device When the memory requirement is exceeded, it is determined that the input data allocate failed.

For example, when the maximum value of j is 6, in Scheme 7, the following 6 block strategies can be included:

Method 1: Determine 1 time of the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, use the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;

Method 2: Determine twice the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, take the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block, get a one-dimensional block result;

...

Mode 6: Determine 6 times the minimum particle size of the first dimension of the constant data as the target size of the partial constant data, take the partial constant data as the initial data, and perform a one-dimensional analysis of the input data based on the determined first dimension of the input data. Block to get a one-dimensional block result.

Here, the first dimension and the second dimension for dicing the input data can be set according to information such as operation requirements and/or operator types; and the first dimension and the second dimension for dicing the constant data can be set according to Information such as operation requirements and/or operator types are set. For example, if the operator is a convolution operator, the first dimension of the constant data may be the output channel (hereinafter referred to as OC) dimension, and the second dimension may be the input channel (hereinafter referred to as IC) dimension.

In an optional embodiment, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined first dimension of the target data, the target data One-dimensional block is performed to obtain one-dimensional block results, including:

A1: Determine k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, divide the target data into one-dimensional blocks according to the first dimension, and obtain the multi-dimensional block corresponding to the target data. target data blocks; among them, k is a positive integer;

A2, when it is determined that multiple target data blocks and the initial data meet the set block conditions, take k+1 times the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to Based on the size of the target block, the target data is divided into one-dimensional blocks according to the first dimension until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the smallest particle corresponding to the first dimension of the target data is divided. k times the degree is determined as the block result.

A3. If the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, determine that the block result is a one-dimensional block failure.

In step A1, k is a positive integer. Starting from k=1, the minimum granularity corresponding to the first dimension of the target data is determined as the target block size, and according to the target block size, the target data is divided into one-dimensional blocks according to the first dimension, and the corresponding target data is obtained. Multiple target data blocks. The obtained size of the first dimension of each target data block is consistent with the size of the target partition, and the size of each of the target data blocks except the first dimension is consistent with the size of the corresponding dimension of the target data.

For example, if the minimum granularity of the first dimension is 32, and the size information of the target data is 64×64×128, then the target block size is 32. According to the target block size, the target data is divided into one dimension according to the first dimension. block to obtain multiple target data blocks, and the size of each target data block can be 32×64×128. The number of target data blocks may be determined according to actual conditions.

The first dimension can be set as required. For example, the first dimension of the input data can be the width W dimension, and the second dimension can be the input channel IC dimension; the first dimension of the constant data can be the output channel OC dimension, and the second dimension can be the output channel OC dimension. The dimension may be the input channel IC dimension.

Further, it can be judged whether the multiple target data blocks and the initial data meet the set block conditions, and if so, take twice the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to According to the target block size, the target data is divided into one-dimensional blocks according to the first dimension, until it is determined that multiple target data blocks and the initial data do not meet the set block conditions, and the minimum corresponding to the first dimension of the target data is determined. The k times the granularity is determined as the block result. For example, when k=5, it is determined that multiple target data blocks and initial data generated when k=5 do not meet the set block conditions, and 4 times the minimum granularity corresponding to the first dimension of the target data is determined as block result. That is, when running the network layer to be processed, 4 times the minimum granularity of the first dimension can be used as the target block size, and the target data corresponding to the target operator of the network layer to be processed can be divided into one-dimensional blocks according to the target block size. .

If it is not satisfied (that is, the multiple target data blocks and the initial data generated when k=1 do not satisfy the set partitioning condition), the result of the partitioning is determined to be a one-dimensional partitioning failure.

In an optional embodiment, when the specified dimension is two-dimensional and the dimension parameter includes the second dimension, the constant data and the input data are respectively used as the target data, and based on the determined first dimension and the second dimension of the target data, Perform two-dimensional block on the target data to obtain block results, including:

B1, determining y times the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, one-dimensional block the target data according to the first dimension, Obtain multiple intermediate data blocks corresponding to the target data; wherein, y is a positive integer;

B2, determine x times the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, divide each intermediate data block into two-dimensional blocks according to the second dimension , to obtain multiple target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;

B3, when it is determined that the multiple target data blocks and the initial data meet the set block conditions, then take x+1 times of the minimum granularity corresponding to the second dimension of the target data as the updated second target block size , return to the step of dividing each intermediate data block into two-dimensional blocks according to the second dimension based on the second target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set block conditions, the target block The x times of the minimum granularity corresponding to the second dimension of the data is determined as the block result.

In B1, y is a positive integer with an initial value of 1. For example, when the maximum value of y is set to 3, y can be determined as 1, and steps B1 to B3 are executed to obtain a two-dimensional block result; y is determined to be 2, and steps B1 to B3 are performed to obtain a two-dimensional block result; y is determined to be 3, and steps B1 to B3 are performed to obtain a two-dimensional block result, that is, three two-dimensional block results can be obtained.

Taking y=1 as an example to illustrate the two-dimensional block process, if the minimum granularity corresponding to the first dimension is 32 and the size of the target data is 128×128×256, then the target The data is divided into one-dimensional blocks according to the first dimension, and multiple target intermediate data blocks corresponding to the target data are obtained, and the size of each target intermediate data block may be 32×128×256. The number of target intermediate data blocks may be determined according to actual conditions.

In B2, continue the description following the example in B1, x is a positive integer, starting from x=1, determine 1 times the minimum granularity corresponding to the second dimension of the target data as the second target block size, for example, If the minimum granularity of the second dimension is 32, the second target block size is 32. Based on the second target block size, each intermediate data block is divided into two-dimensional blocks according to the second dimension to obtain each intermediate data block. The corresponding multiple target data blocks, that is, multiple target data blocks are obtained, and the size of each target data block may be 32×32×256.

In B3, it can be judged whether the multiple target data blocks and the initial data meet the set segmentation conditions, and if so, 2 (ie x+1) times the minimum granularity corresponding to the second dimension of the target data is used as the updated The second target block size is returned to based on the second target block size, and each intermediate data block is divided into two-dimensional blocks according to the second dimension, until it is determined that multiple target data blocks and initial data do not meet the set Up to the block condition, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.

For example, when x=3, it is determined that the multiple target data blocks and the initial data generated when x=3 do not meet the set block conditions, and 2 times the minimum granularity corresponding to the second dimension of the target data is determined as the block result. That is, when running the network layer to be processed, the minimum granularity of the first dimension can be used as the first target block size, and twice the minimum granularity of the second dimension can be used as the second target block size, based on the first target block size. The target block size and the second target block size are two-dimensional blocks for the target data corresponding to the target operator of the network layer to be processed.

In an optional embodiment, when the parameter data corresponding to the network layer to be processed also includes output data, determining that multiple target data blocks and initial data meet the set block conditions, including: determining initial data, output data , and each target data block respectively meet the memory requirements of the computing device, and when the initial data, output data, and each target data block respectively meet the DMA transfer requirements in the computing device, determine that multiple target data blocks and initial data satisfy The set block condition.

Here, the memory requirements of the computing device may be set according to user requirements and/or computing device requirements. For example, it can be determined whether the total data capacity of the initial data, output data, and each target data block is less than or equal to the set memory capacity of the computing device, and if so, it is determined to meet the memory requirements of the computing device.

Alternatively, it is also possible to determine whether the data capacity of the initial data is less than or equal to the first local memory capacity allocated for the initial data on the memory of the computing device, and determine whether the data capacity of the output data is less than or equal to the output data on the memory of the computing device. The allocated second local memory size, and determining whether the data size of each target data block is less than or equal to the three local memory sizes allocated for the target data on the memory of the computing device, if the initial data, output data and each target data block If all requirements are met, it is determined that the memory requirements of the computing device are met.

In specific implementation, special memory and public memory can also be set. If the constant data is set to be stored in the public memory, and the input data and output data are stored in the special memory, the initial data, output data, and each target data block can be determined. Whether both of them meet the memory requirements of the corresponding dedicated memory and public memory, and if so, determine that the memory requirements of the computing device are met. That is, when the initial data is the input data and the target data block is the target data block corresponding to the constant data, then judge whether the data capacity of the initial data and output data is less than or equal to the set memory capacity of the dedicated memory, and judge whether each target data block is It is less than or equal to the set memory capacity of the public memory. If all are satisfied, it is determined that the memory requirements of the computing device are satisfied.

Exemplarily, after each target data block is determined, an attempt to allocate the target data block, initial data, and output data is performed. If the allocate attempt is successful, it is determined that the initial data, output data, and each target data block satisfy the requirements of the computing device. memory requirements.

Among them, the DMA transmission requirements can be determined according to actual needs. For example, if it is determined that the total data capacity of the initial data, output data, and each target data block is less than or equal to the data capacity that can be transferred by DMA, that is, when it is determined that the DMA task is successfully established, it is determined that the DMA transfer requirements in the computing device are met.

When it is determined that the initial data, the output data and each target data block meet the memory requirements of the computing device and meet the DMA transfer requirements in the computing device, it is determined that multiple target data blocks and the initial data meet the set block conditions.

For S103: after determining the target operator and the target block strategy corresponding to each network layer to be processed in the target neural network, the target block strategy including the target operator can be run based on the target block strategy corresponding to at least one network layer to be processed respectively. Neural Networks.

For example, the image to be processed can be input into the target neural network, and the computing device uses the target segmentation strategy and target operator corresponding to each network layer to be processed to perform feature extraction on the image to be processed, and determine the detection result corresponding to the image to be processed. For example, the detection result may be the category of the target object included in the image to be processed, the position information of the target object, the contour information of the target object, and the like.

Exemplarily, referring to a schematic diagram of software and hardware scheduling of a computing device in a method for operating a neural network shown in FIG. 4 , the process of using ping-pong scheduling to process the parameter data of the network layer to be processed will be described with reference to FIG. 4 . The memory of the computing device is divided into an initial data area, a scheduling data area ping, a scheduling data area ping, an output data area ping, and an output data area ping. When the initial data is input data, the scheduling data is constant data; when the initial data is constant data, the scheduling data is input data.

It can be seen from Figure 4 that the computing device and the DMA are running in parallel, and the DMA first transmits the initial data and the scheduling ping (that is, the scheduling data ping) to the corresponding memory area of the computing device (that is, transferring the initial data to the initial data area corresponding to the computing device). memory area, output the scheduling data ping to the memory area corresponding to the scheduling data area ping of the computing device); the computing device processes the initial data and the scheduling ping, and at the same time, the DMA can also transmit the scheduling pong (that is, the scheduling data pong) to The memory area corresponding to the scheduling data pong of the computing device.

After the computing device finishes processing the initial data and scheduling ping, an output ping (ie, output data ping) is generated, and the output ping is placed in the memory area corresponding to the output data area ping of the computing device, and the output ping is sent from the output data area of the computing device through DMA. The output ping is obtained from the memory area corresponding to the ping, and then the output ping is transmitted to the corresponding external memory (such as DDR). The computing device then processes the received scheduling ping, and at the same time, the DMA transmits the next scheduling ping to the memory area corresponding to the scheduling ping of the computing device, and repeats the above process until the processing of the parameter data of the layer to be processed is completed.

Those skilled in the art can understand that in the above method of the specific implementation, the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.

Based on the same concept, an embodiment of the present disclosure also provides a neural network operating apparatus. Referring to FIG. 5 , a schematic diagram of the architecture of the neural network operating apparatus provided by the embodiment of the present disclosure includes a first determining module 501, a second determining Module 502, running module 503, specifically:

The first determination module 501 is used to determine the network layer to be processed in the target neural network;

The second determination module 502 is configured to determine, from the determined multiple operators and multiple block strategies, the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; wherein the multiple Each of the operators is used to implement the function corresponding to the network layer to be processed, and each of the multiple block strategies matches the computing device used to run the target neural network operating requirements;

A running module 503 is configured to run the target neural network including the target operator based on the target block strategy corresponding to the network layer to be processed.

In a possible implementation manner, the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed;

Among the multiple blocking strategies, based on the parameter data obtained by segmenting the parameter data of the target operator by using the target blocking strategy, the resource consumption of running the network layer to be processed is minimal.

In a possible implementation manner, in the case that there are multiple network layers to be processed, the second determination module 502 determines the target from the determined multiple operators and multiple block strategies. When the target operator and target block strategy corresponding to the network layer to be processed in the neural network are used for:

For each to-be-processed network layer in the target neural network, a target candidate operator corresponding to the to-be-processed network layer is determined from the plurality of operators, and a target candidate operator corresponding to the to-be-processed network layer is determined from the multiple block strategies The target candidate block strategy matched by the target candidate operator;

In the case that there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators corresponding to each network layer to be processed and The target candidate block strategy is to determine the target operator and the target block strategy corresponding to each network layer to be processed.

In a possible implementation, the second determination module 502 determines the target operator corresponding to each network layer to be processed based on the target candidate operator and target candidate block strategy corresponding to each network layer to be processed. child and the target chunking strategy, used to:

A plurality of test networks corresponding to the target neural network are determined based on the target candidate operators corresponding to the respective network layers to be processed and the target candidate block strategy corresponding to the target candidate operators; wherein, in each test network Including a target candidate operator corresponding to each of the network layers to be processed, and a target candidate block strategy matching the target candidate operator;

Running the multiple test networks respectively to obtain multiple test results, wherein each test network corresponds to one test result;

Selecting a target test network from the plurality of test networks based on the plurality of test results;

The target candidate operator and target candidate block strategy of the network layer to be processed in the target test network are determined as the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network respectively.

In a possible implementation manner, the second determination module 502, for each network layer to be processed in the target neural network, determines the target candidate operator corresponding to the network layer to be processed from the plurality of operators. , and when determining a target candidate block strategy matching the target candidate operator from the multiple block strategies, used for:

For the to-be-processed network layer, from the plurality of operators, determine one or more first candidate operators;

Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.

In a possible implementation manner, the resource consumption is represented by a computational cost value, and the second determination module 502 is configured to determine the computational cost of the first candidate operator under each block strategy according to the following steps: value:

Determine a restricted scenario corresponding to the first candidate operator under a preset size, wherein the restricted scenario is based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size time consuming;

In the case where the restricted scenario is a bandwidth-restricted scenario, the direct memory operation DMA corresponding to the first candidate operator under the partitioning strategy is determined based on the partitioning result of the partitioning strategy. The total amount of data transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device , determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is calculated according to the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the candidate operator to convert the data arrangement;

In the case where the restricted scene belongs to a calculation-limited scene, based on the segmentation result of the segmentation strategy, determine the calculation of the parameter data corresponding to the first candidate operator under the segmentation strategy Time consumption, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; amount, data conversion overhead, DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, to determine the computing overhead value of the first candidate operator under the blocking strategy.

In a possible implementation manner, the second determining module 502, based on the resource consumption situation of the first candidate operator under each of the multiple partitioning strategies, determines from the first candidate operator. In a candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target candidate operator. When a block strategy is used, it is used to:

A target resource consumption situation that satisfies a preset condition is selected from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, one first candidate operator corresponds to one of the resources under a block strategy consumption;

Determining the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer containing the second candidate operator corresponding to the target resource consumption situation, determining a test result corresponding to the candidate block strategy and the second candidate operator;

Based on the test result, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator are determined.

In a possible implementation manner, from the first candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target Before the target candidate block strategy corresponding to the candidate operator, it also includes:

The alignment module 504 is configured to perform an alignment operation on the parameter data corresponding to the first candidate operator based on the determined minimum granularity information corresponding to the target neural network to obtain the aligned data corresponding to the first candidate operator. parameter data;

Wherein, the minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions is the smallest in the corresponding dimension indicated by the minimum granularity information Integer multiple of particle size.

In a possible implementation manner, when the parameter data includes input data and constant data, the multiple block strategies include at least one of the following:

Taking all the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; the initial data is when the computing device runs the target neural network , write the data in the initial data area allocated by the direct memory operation DMA task;

Taking all the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result;

Taking part of the input data as the initial data, and based on the determined dimension parameters of the constant data, the constant data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the input data is according to the Minimum granularity determination of the first dimension of the input data;

Taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is according to the The minimum granularity of the first dimension of the constant data is determined.

In a possible implementation manner, taking part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result, including:

determining the target size of the part of the input data based on i times the minimum granularity of the first dimension of the input data;

The part of the input data of the target size is respectively used as initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result;

Wherein, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the data capacity of the constant data block determined based on the minimum granularity of the dimension parameter of the constant data, which satisfies the calculation A positive integer for the memory requirement of the device.

In a possible implementation manner, taking part of the constant data as initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks of a specified dimension to obtain a block result, including:

determining a target size of the portion of constant data based on j times the minimum granularity of the first dimension of the constant data;

The part of the constant data of the target size is respectively used as initial data, and based on the determined dimension parameter of the input data, the input data is subjected to a block of a specified dimension to obtain a block result;

Among them, j is the data capacity of the partial constant data after the target size of the partial constant data is determined, and the determined data capacity of the input data block based on the minimum granularity of the dimension parameter of the input data, which satisfies the calculation A positive integer for the memory requirement of the device.

In a possible implementation, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as target data, and based on the determined target In the first dimension of the data, one-dimensional block is performed on the target data, and a one-dimensional block result is obtained, including:

Determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, one-dimensional block the target data according to the first dimension, Obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer;

In the case that it is determined that the multiple target data blocks and the initial data meet the set block conditions, take k+1 times of the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to the step of performing one-dimensional block on the target data according to the first dimension based on the target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set points. Block condition, determining k times of the minimum granularity corresponding to the first dimension of the target data as the block result;

In a case where the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, it is determined that the block result is a one-dimensional block failure.

In a possible implementation, when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as target data, and based on the determined target The first dimension and the second dimension of the data, the two-dimensional block is performed on the target data, and the two-dimensional block result is obtained, including:

Determine y times of the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, divide the target data into one-dimensional division according to the first dimension. block, obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer;

Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform each intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;

In the case that it is determined that the multiple target data blocks and the initial data satisfy the set block condition, then x+1 times the minimum granularity corresponding to the second dimension of the target data is used as the updated second The target block size, returning to the step of performing two-dimensional block on each intermediate data block according to the second dimension based on the second target block size, until it is determined that the multiple target data blocks and the initial data are not identical. Until the set block condition is satisfied, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.

In a possible implementation manner, when the parameter data corresponding to the network layer to be processed further includes output data, it is determined that the multiple target data blocks and the initial data satisfy the set block conditions, including:

After determining that the initial data, the output data, and each target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and each target data block satisfy the calculation In the case of DMA transfer requirements in the device, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.

In some embodiments, the functions or templates included in the apparatus provided by the embodiments of the present disclosure may be used to execute the methods described in the above method embodiments. For specific implementation, reference may be made to the above method embodiments. For brevity, here No longer.

Based on the same technical concept, an embodiment of the present disclosure also provides an electronic device. Referring to FIG. 6 , a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure includes a processor 601 , a memory 602 , and a bus 603 . Among them, the memory 602 is used to store the execution instructions, including the memory 6021 and the external memory 6022; the memory 6021 here is also called the internal memory, and is used to temporarily store the operation data in the processor 601 and the data exchanged with the external memory 6022 such as the hard disk, The processor 601 exchanges data with the external memory 6022 through the memory 6021. When the electronic device 600 is running, the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:

Determine the network layers to be processed in the target neural network;

From the determined multiple operators and multiple block strategies, determine the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; each operator in the multiple operators is For realizing the function corresponding to the network layer to be processed, each of the multiple block strategies matches the operating requirements of the computing device used to run the target neural network;

The target neural network including the target operator is run based on the target block strategy corresponding to the to-be-processed network layer.

In addition, an embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the execution of the neural network operation method described in the foregoing method embodiment is executed. step. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program codes, and the instructions included in the program codes can be used to execute the steps of the neural network operation method described in the foregoing method embodiments. For details, please refer to the foregoing The method embodiments are not repeated here.

Wherein, the above-mentioned computer program product can be specifically implemented by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the system and device described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here. In the several embodiments provided by the present disclosure, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. The apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, which may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-executable non-volatile computer-readable storage medium. Based on such understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the parts that contribute to the prior art or the parts of the technical solutions. The computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person skilled in the art who is familiar with the technical scope of the present disclosure can easily think of changes or substitutions, which should be covered within the scope of the present disclosure. within the scope of the present disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

A method for running a neural network, comprising:

Determine the network layers to be processed in the target neural network;

From the determined multiple operators and multiple block strategies, determine the target operator and the target block strategy corresponding to the to-be-processed network layer in the target neural network; the multiple operators are used to implement the The function corresponding to the network layer to be processed, the multiple block strategies match the operation requirements of the computing device used to run the target neural network;

The target neural network including the target operator is run based on the target block strategy corresponding to the to-be-processed network layer.
The method according to claim 1, wherein the block strategy is used to block the parameter data of the target operator corresponding to the network layer to be processed;

Among the multiple blocking strategies, based on the parameter data obtained by segmenting the parameter data of the target operator by using the target blocking strategy, the resource consumption of running the network layer to be processed is minimal.
The method according to claim 1 or 2, characterized in that, in the case that there are multiple network layers to be processed, the target is determined from multiple determined operators and multiple blocking strategies The target operator and target block strategy corresponding to the network layer to be processed in the neural network, including:

For any to-be-processed network layer in the target neural network, determine a target candidate operator corresponding to the to-be-processed network layer from the multiple operators, and determine from the multiple block strategies The target candidate block strategy matched by the target candidate operator;

In the case that there are multiple target candidate operators corresponding to any network layer to be processed and/or multiple target candidate blocking strategies, based on the target candidate operators corresponding to the network layer to be processed and The target candidate block strategy is to determine the target operator corresponding to the to-be-processed network layer and the target block strategy.
The method according to claim 3, wherein the target operator and the target candidate block strategy corresponding to the network layer to be processed are determined based on the target candidate operator and the target candidate block strategy corresponding to the network layer to be processed. The target block strategy includes:

Based on the target candidate operator corresponding to the network layer to be processed and the target candidate block strategy corresponding to the target candidate operator, a plurality of test networks corresponding to the target neural network are determined; Any test network in the network includes a target candidate operator corresponding to the network layer to be processed, and a target candidate block strategy matched with the target candidate operator;

Running the multiple test networks respectively to obtain multiple test results, wherein any one of the multiple test networks corresponds to one test result;

Selecting a target test network from the plurality of test networks based on the plurality of test results;

Determining the target candidate operator and target candidate block strategy of the to-be-processed network layer in the target test network as the target operator and target block corresponding to the to-be-processed network layer in the target neural network Strategy.
The method according to claim 3 or 4, wherein, for any to-be-processed network layer in the target neural network, the target candidate corresponding to the to-be-processed network layer is determined from the plurality of operators operator, and determine a target candidate block strategy matching the target candidate operator from the multiple block strategies, including:

For the to-be-processed network layer, from the plurality of operators, determine one or more first candidate operators;

Based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, select the first candidate operator and the multiple partitioning strategies. One or more target candidate operators corresponding to the network layer to be processed, and a target candidate block strategy corresponding to the target candidate operator.
The method according to claim 5, wherein the resource consumption is represented by a computational cost value, and the computational cost value of the first candidate operator under each block strategy is determined according to the following steps:

Determine a restricted scenario corresponding to the first candidate operator under a preset size, wherein the restricted scenario is based on the calculation time and transmission time of the data capacity corresponding to the first candidate operator under the preset size time consuming;

In the case where the restricted scenario is a bandwidth-restricted scenario, the direct memory operation DMA corresponding to the first candidate operator under the partitioning strategy is determined based on the partitioning result of the partitioning strategy. The total amount of data transmission, the number of DMA tasks, and the data conversion overhead; based on the total amount of DMA data transmission, the number of DMA tasks, the data conversion overhead, and the DMA rate and DMA task overhead corresponding to the computing device , determine the calculation cost value of the first candidate operator under the block strategy; wherein, the data conversion cost is calculated according to the target data arrangement mode corresponding to the first candidate operator, for the first candidate operator The time consumed by the input data corresponding to the candidate operator to convert the data arrangement;

In the case where the restricted scene belongs to a calculation-limited scene, based on the segmentation result of the segmentation strategy, determine the calculation of the parameter data corresponding to the first candidate operator under the segmentation strategy Time consumption, the number of operator calls of the first candidate operator, the total amount of initial data transmission, the number of DMA tasks, and the data conversion overhead; amount, data conversion overhead, DMA task overhead, the number of DMA tasks, and the DMA rate corresponding to the computing device, to determine the computing overhead value of the first candidate operator under the blocking strategy.
The method according to claim 5 or 6, characterized in that, based on the resource consumption of the first candidate operator under each of the multiple partitioning strategies, from the first candidate operator In a candidate operator and the multiple blocking strategies, select one or more target candidate operators corresponding to the network layer to be processed, and one or more target candidate operators corresponding to the target candidate operator. Block strategies, including:

A target resource consumption situation that satisfies a preset condition is selected from a plurality of the resource consumption situations corresponding to the first candidate operator; wherein, one first candidate operator corresponds to one of the resources under a block strategy consumption;

Determining the block strategy corresponding to the target resource consumption situation as a candidate block strategy, and based on the candidate block strategy, run the to-be-processed network layer containing the second candidate operator corresponding to the target resource consumption situation, determining a test result corresponding to the candidate block strategy and the second candidate operator;

Based on the test result, one or more target candidate operators corresponding to the network layer to be processed and a target candidate block strategy corresponding to the target candidate operator are determined.
The method according to any one of claims 5 to 7, wherein one or more targets corresponding to the network layer to be processed are selected from the first candidate operator and the multiple blocking strategies Before the candidate operator and the target candidate block strategy corresponding to the target candidate operator, it also includes:

Based on the determined minimum granularity information corresponding to the target neural network, an alignment operation is performed on the parameter data corresponding to the first candidate operator to obtain the aligned parameter data corresponding to the first candidate operator;

The minimum granularity information includes the minimum granularity corresponding to the parameter data in different dimensions; the size of the aligned parameter data in different dimensions is the corresponding dimension indicated by the minimum granularity information. An integer multiple of the minimum particle size.
The method according to any one of claims 1 to 8, wherein, in the case that the parameter data includes input data and constant data, the multiple block strategies include at least one of the following:

Taking all the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; the initial data is when the computing device runs the target neural network , write the data in the initial data area allocated by the direct memory operation DMA task;

Taking all the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is divided into blocks with a specified dimension to obtain a block result;

Taking part of the input data as the initial data, based on the determined dimension parameters of the constant data, the constant data is divided into blocks with a specified dimension to obtain a block result; wherein, the target size of the part of the input data is according to the Minimum granularity determination of the first dimension of the input data;

Taking part of the constant data as the initial data, and based on the determined dimension parameters of the input data, the input data is subjected to a block of a specified dimension to obtain a block result; wherein, the target size of the part of the constant data is according to the The minimum granularity of the first dimension of the constant data is determined.
The method according to claim 9, characterized in that, using part of the input data as initial data, and based on the determined dimension parameters of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result. ,include:

determining the target size of the part of the input data based on i times the minimum granularity of the first dimension of the input data;

The part of the input data of the target size is respectively used as initial data, and based on the determined dimension parameter of the constant data, the constant data is divided into blocks of a specified dimension to obtain a block result;

Wherein, i is the data capacity of the partial input data after the target size of the partial input data is determined, and the data capacity of the constant data block determined based on the minimum granularity of the dimension parameter of the constant data, which satisfies the calculation A positive integer for the memory requirement of the device.
The method according to claim 9, characterized in that, taking part of the constant data as initial data, and based on the determined dimension parameter of the input data, the input data is divided into blocks of a specified dimension to obtain a block result. ,include:

determining a target size of the portion of constant data based on j times the minimum granularity of the first dimension of the constant data;

The part of the constant data of the target size is respectively used as initial data, and based on the determined dimension parameter of the input data, the input data is divided into blocks of a specified dimension to obtain a block result;

Among them, j is the data capacity of the partial constant data after the target size of the partial constant data is determined, and the determined data capacity of the input data block based on the minimum granularity of the dimension parameter of the input data, which satisfies the calculation A positive integer for the memory requirement of the device.
The method according to any one of claims 9 to 11, wherein, when the specified dimension is one dimension and the dimension parameter includes the first dimension, the constant data and the input data are respectively used as Target data, based on the determined first dimension of the target data, perform one-dimensional segmentation on the target data to obtain a segmentation result, including:

Determining k times the minimum granularity corresponding to the first dimension of the target data as the target block size, and based on the target block size, one-dimensional block the target data according to the first dimension, Obtain a plurality of target data blocks corresponding to the target data; wherein, k is a positive integer;

In the case that it is determined that the multiple target data blocks and the initial data meet the set block conditions, take k+1 times of the minimum granularity corresponding to the first dimension of the target data as the updated target block size, and return to the step of performing one-dimensional block on the target data according to the first dimension based on the target block size, until it is determined that the multiple target data blocks and the initial data do not meet the set points. Block condition, determining k times of the minimum granularity corresponding to the first dimension of the target data as the block result;

In a case where the initial data and the multiple target data blocks generated when k is equal to 1 do not meet the set block conditions, it is determined that the block result is a one-dimensional block failure.
The method according to any one of claims 9 to 12, wherein when the specified dimension is two-dimensional and the dimension parameter includes a second dimension, the constant data and the input data are respectively used as Target data, based on the determined first dimension and second dimension of the target data, perform two-dimensional segmentation on the target data to obtain a segmentation result, including:

Determine y times of the minimum granularity corresponding to the first dimension of the target data as the first target block size, and based on the first target block size, divide the target data into one-dimensional division according to the first dimension. block, obtain a plurality of intermediate data blocks corresponding to the target data; wherein, y is a positive integer;

Determine x times of the minimum granularity corresponding to the second dimension of the target data as the second target block size; based on the second target block size, perform at least one intermediate data block according to the second dimension. Two-dimensional block, to obtain a plurality of target data blocks corresponding to each intermediate data block; wherein, x is a positive integer;

In the case that it is determined that the multiple target data blocks and the initial data satisfy the set block condition, then x+1 times the minimum granularity corresponding to the second dimension of the target data is used as the updated second Target block size, returning to the step of performing two-dimensional block on at least one intermediate data block according to the second dimension based on the second target block size, until it is determined that the plurality of target data blocks and the initial data are not identical. Until the set block condition is satisfied, x times the minimum granularity corresponding to the second dimension of the target data is determined as the block result.
The method according to claim 12 or 13, wherein, in the case that the parameter data corresponding to the network layer to be processed includes output data, the determining that the multiple target data blocks and the initial data satisfy the setting block conditions, including:

After determining that the initial data, the output data, and at least one target data block satisfy the memory requirements of the computing device, respectively, and that the initial data, the output data, and at least one target data block satisfy the calculation In the case of DMA transfer requirements in the device, it is determined that the multiple target data blocks and the initial data satisfy the set block condition.
A device for running a neural network, comprising:

The first determination module is used to determine the network layer to be processed in the target neural network;

The second determination module is configured to determine, from the determined multiple operators and multiple block strategies, the target operator and the target block strategy corresponding to the network layer to be processed in the target neural network; the multiple operators is used to realize the function corresponding to the network layer to be processed, and the multiple block strategies match the operation requirements of the computing device used to run the target neural network;

An operation module, configured to run the target neural network including the target operator based on the target block strategy corresponding to the network layer to be processed.
An electronic device, characterized in that it includes: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device runs, the processor and the memory are connected The machine-readable instructions are executed by the processor to perform the steps of the neural network operating method according to any one of claims 1 to 14.
A computer-readable storage medium, characterized in that, a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the neural network operation method according to any one of claims 1 to 14 are executed .
A computer program comprising computer readable code, when the computer readable code is executed in an electronic device, a processor in the electronic device executes the code for implementing any one of claims 1-14. Methods.