WO2020164469A1

WO2020164469A1 - Neural network calculation method and apparatus, mobile terminal and storage medium

Info

Publication number: WO2020164469A1
Application number: PCT/CN2020/074719
Authority: WO
Inventors: 刘耀勇; 陈岩
Original assignee: Oppo广东移动通信有限公司
Priority date: 2019-02-12
Filing date: 2020-02-11
Publication date: 2020-08-20
Also published as: CN109902819A; CN109902819B

Abstract

A neural network calculation method and apparatus, a mobile terminal and a storage medium, the method comprising: a mobile terminal acquiring M operators to be executed, and calculating the dependency between the M operators to be executed, wherein N is an integer greater than or equal to two (101); the mobile terminal cutting the M operators to be executed according to the dependency between the M operators to be executed so as to obtain N operator sets, each operator set among the N operator sets comprising at least one operator, and N being an integer greater than or equal to two (102); and if the N operator sets are operator sets that are independent of one other, the mobile terminal enabling N threads to respectively calculate the operators in the N operator sets (103). The described method may reduce the inference time of the neural network.

Description

Neural network calculation method, device, mobile terminal and storage medium

Technical field

This application relates to the field of communication technology, and in particular to a neural network calculation method, device, mobile terminal and storage medium.

Background technique

In the current neural network algorithm framework (for example, Tensorflow Lite), when performing neural network calculations, all operators that need to be executed are added to a queue to be executed, and then the processor calls and executes these operators in turn, that is, These operators are executed sequentially in a thread. As the neural network becomes more and more complex and the number of operators increases, the reasoning time of the neural network will become longer.

Summary of the invention

The embodiments of the present application provide a neural network calculation method, device, mobile terminal, and storage medium, which can reduce the reasoning time of the neural network.

In the first aspect, an embodiment of the present application provides a neural network calculation method based on a neural network algorithm framework, including:

Acquiring M to-be-executed operators, and calculating the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2;

Cut the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain N operator sets, each of the N operator sets includes at least 1 Operators, N is an integer greater than or equal to 2;

If the N operator sets are mutually independent operator sets, N threads are activated to perform calculations on the operators in the N operator sets respectively.

In a second aspect, an embodiment of the present application provides a neural network computing device. The neural network computing device includes a communication unit and a processing unit, wherein:

The communication unit is used to obtain M operators to be executed;

The processing unit is configured to calculate the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2; The M to-be-executed operators are cut to obtain N operator sets, each of the N operator sets includes at least one operator, and N is an integer greater than or equal to 2; and In a case where the N operator sets are mutually independent operator sets, N threads are activated to perform calculations on the operators in the N operator sets respectively.

In a third aspect, an embodiment of the present application provides a mobile terminal, including a processor and a memory, the memory is used to store one or more programs, and the one or more programs are configured to be executed by the processor. The program includes instructions for executing the steps in the first aspect of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the foregoing computer-readable storage medium stores a computer program for electronic data exchange, wherein the foregoing computer program enables a computer to execute Some or all of the steps described in one aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute Example part or all of the steps described in the first aspect. The computer program product may be a software installation package.

It can be seen that the neural network calculation method based on the neural network algorithm framework described in the embodiments of this application obtains M to-be-executed operators when performing neural network calculations, and calculates the dependency between the M to-be-executed operators , N is an integer greater than or equal to 2; cut the M operators to be executed according to the dependency between the M operators to be executed, and obtain N operator sets, and each operator in the N operator sets The set includes at least one operator, and N is an integer greater than or equal to 2; if the N operator sets are mutually independent operator sets, N threads are enabled to perform calculations on the operators in the N operator sets. The embodiment of the application can cut the operator to be executed. When the N operator sets obtained by the cut are mutually independent operator sets, N threads are enabled to perform calculations on the operators in the N operator sets. N threads perform calculations on the operators in the N operator sets at the same time, which can increase the calculation speed of the neural network, thereby reducing the reasoning time of the neural network.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a neural network calculation method based on a neural network algorithm framework disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of a dependency relationship between operators disclosed in an embodiment of the present application;

3 is a schematic flowchart of another neural network calculation method based on a neural network algorithm framework disclosed in an embodiment of the present application;

Fig. 4 is a schematic structural diagram of a neural network computing device disclosed in an embodiment of the present application;

Fig. 5 is a schematic structural diagram of a mobile terminal disclosed in an embodiment of the present application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

The terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

Reference to "embodiments" herein means that a specific feature, structure or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present invention. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

The mobile terminals involved in the embodiments of this application may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems, as well as various forms of user equipment (User Equipment, UE), mobile station (Mobile Station, MS), terminal device (terminal device), etc. For ease of description, the devices mentioned above are collectively referred to as mobile terminals.

The following describes the embodiments of the present application in detail.

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a neural network calculation method based on a neural network algorithm framework disclosed in an embodiment of the present application. As shown in FIG. 1, the neural network calculation method based on a neural network algorithm framework includes the following steps .

101. The mobile terminal obtains M to-be-executed operators, and calculates the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2.

In the embodiment of this application, the neural network algorithm framework may be TensorFlow or TensorFlow Lite. Among them, TensorFlow is a framework for training and running neural network models that runs on a personal computer (PC). TensorFlow Lite is a framework for training and running neural network models that runs on the mobile terminal. The mobile terminal can run IOS system or Android system.

The neural network algorithm framework can include a controller unit, a computing unit, and a storage unit. The controller unit is used to store instructions and processing instructions. The arithmetic unit is used to calculate the operator, and the storage unit is used to store neurons, weights, etc. Operator is the abbreviation of operator. In the neural network model, an operator represents a calculation, such as addition, subtraction, multiplication, and division, which means 4 operators. In the neural network model, when performing neural network inference, multiple operators need to be calculated, and all the current operators are executed serially, which leads to a long inference time of the neural network.

In the embodiment of the present application, when performing neural network inference, multiple operators need to be calculated. After acquiring the M operators to be executed, the controller unit calculates the dependency relationship among the M operators to be executed. The M to-be-executed operators can be operators that need to be executed during the entire neural network inference process, they can also be operators that need to be executed during the calculation of a certain layer of neural network, or they can be part of the need for a certain layer of neural network calculation. Operator to perform.

The operators in the embodiments of this application may include Conv2D operators, FusedBatchNorm operators, Relu operators, DepthwiseConv2dNative operators, MaxPool operators, BiasAdd operators, ConcatV2 operators, and so on.

The Conv2D operator calculates a two-dimensional convolution from a given four-dimensional input data and a four-dimensional filter tensor. The four-dimensional filter tensor can also be called a four-dimensional convolution kernel tensor. The Conv2D operator specifies the four-dimensional input data including the number of training samples (batch), the height of the input data (inputHeight), the width of the input data (inputWidth), and the number of channels of the input data (inputChannel). The four-dimensional filter tensor includes filter height (filterHeight), filter width (filterWidth), filter channel number (filterChannel), filter number (filterNumber). The Conv2D operator performs sliding multiplication and addition operations on the four-dimensional input data according to a certain strides of the four-dimensional filter tensor to obtain a two-dimensional convolution result.

FusedBatchNorm operator is an operator that is often used in deep neural networks to accelerate neural network training. It can accelerate the convergence speed and stability, and is an indispensable part of deep neural networks.

The Relu operator, also known as the ReLU function, represents the "modified linear unit", which is the maximum function (x, o) of the input x with the convolutional image. The ReLU operator sets all negative values in the matrix x to zero, and the rest of the values remain unchanged. The operator of the ReLU function is performed after the convolution operation.

The DepthwiseConv2dNative operator calculates a two-dimensional convolution of the given four-dimensional input data and a four-dimensional filter tensor. The four-dimensional filter tensor can also be called a four-dimensional convolution kernel tensor. The Conv2D operator specifies the four-dimensional input data including the number of training samples (batch), the height of the input data (inputHeight), the width of the input data (inputWidth), and the number of channels of the input data (inputChannel). The four-dimensional filter tensor includes filter height (filterHeight), filter width (filterWidth), filter channel number (filterChannel), output multiplier (channel_multiplier). The Conv2D operator performs sliding multiplication and addition operations on the four-dimensional input data according to a certain strides of the four-dimensional filter tensor to obtain a two-dimensional convolution result.

The MaxPool operator is a type of pooling operator that discards part of the data in the convolution operation result.

The BiasAdd operator is a bias operator. It adds a vector called bias to a matrix called value. The vector is added to each row of the matrix, and the result is the same size as the value matrix. The BiasAdd operator performs addition operations.

The ConcatV2 operator is an operation that connects two matrices. It is used to merge the two matrices. The rows or columns of the merged matrix will increase.

Among them, different operators may have interdependent relationships. For example, after the Conv2D operator is executed, the activation operator, pooling operator, and normalization operator can be executed. The mobile terminal can determine the dependency relationship between each operator according to the sequential execution order of each operator.

For example, please refer to FIG. 2, which is a schematic diagram of a dependency relationship between operators disclosed in an embodiment of the present application. As shown in Figure 2, if there are 8 operators to be executed, they are the first operator, the second operator, the third operator, the fourth operator, the fifth operator, the sixth operator, and the seventh operator. Operator, the eighth operator. Among them, the second operator and the fifth operator can be executed after the first operator is executed, the third operator can be executed after the second operator is executed, and the third operator can be executed after the third operator is executed The fourth operator; the sixth operator can be executed after the fifth operator is executed, and the seventh operator can be executed after the sixth operator is executed; after the fourth operator and the seventh operator are executed, Only then can the eighth operator be executed. It can be seen from Figure 2 that the first operator, the second operator, the third operator, the fourth operator, and the eighth operator have a dependency relationship. The first operator, the fifth operator, and the sixth operator There are dependencies among sub, seventh and eighth operators. The second operator, the third operator, the fourth operator and the fifth operator, the sixth operator, and the seventh operator are mutually independent, and there is no strict order of execution between the two.

102. The mobile terminal cuts the M to-be-executed operators according to the dependency between the M to-be-executed operators to obtain N operator sets, and each of the N operator sets includes at least 1 operator. Sub, N is an integer greater than or equal to 2.

In the embodiment of the present application, the mobile terminal can cut the M operators to be executed according to a certain cutting algorithm according to the dependency relationship between the M operators to be executed, and obtain N operator sets to minimize N The dependence between the operator sets allows as many operator sets as possible among the N operator sets to be independent of each other. Taking Figure 2 as an example, the 8 operators to be executed can be cut into 4 operator sets, where the first operator set includes the first operator, and the second operator set includes the second operator, the third operator, The fourth operator, the third operator set includes the fifth operator, the sixth operator, and the seventh operator, and the fourth operator set includes the eighth operator. Among them, the first operator set and the second operator set , There is a dependency relationship between the third operator set, the fourth operator set, the second operator set, and the third operator set have a dependency relationship, and the second operator set and the third operator set are mutually independent .

103. If the N operator sets are mutually independent operator sets, the mobile terminal activates N threads to calculate operators in the N operator sets respectively.

In the embodiment of this application, if the N operator sets are mutually independent operator sets, it indicates that there is no dependency between the N operator sets, and no operator set needs to be executed before another operator set, then move The terminal can enable N threads to perform calculations on the operators in the N operator sets, so that N threads can be enabled to perform calculations on the operators in the N operator sets at the same time, which can increase the speed of neural network calculations. Reduce the reasoning time of the neural network.

Optionally, step 102 may include the following steps:

According to the dependency relationship between the M operators to be executed, the mobile terminal uses a graph splitting algorithm to cut the M operators to be executed to obtain a set of N operators.

Using graph partitioning algorithm, the directed graph can be divided accurately, so that the dependence between N operator sets is as small as possible, thereby increasing the number of operator sets that can be executed in parallel, thereby increasing the speed of operator calculation .

Optionally, after performing step 101, the following steps may be performed:

The mobile terminal obtains a directed graph between the M operators to be executed according to the dependency relationship between the M operators to be executed.

According to the dependency relationship between the M operators to be executed, the mobile terminal uses the graph split algorithm to cut the M operators to be executed to obtain a set of N operators, which specifically includes:

According to the dependency relationship between the M operators to be executed, the mobile terminal adopts the graph splitting algorithm to cut the directed graph among the M operators to be executed to obtain N directed subgraphs; among them, each directed graph The subgraph corresponds to a set of operators.

The dependency relationship schematic diagram shown in FIG. 2 may also be called a directed graph, where the rectangular boxes shown in FIG. 2 represent operators, and the connecting lines between the rectangular boxes represent dependency relationships. The rectangular box shown in Figure 2 can be abstracted as the points of the directed graph, and the connecting lines can be abstracted as the edges of the directed graph. The end point of the connecting line (the end of the arrow) must be calculated after the start of the connecting line (the beginning of the arrow) can be calculated. The directed graph can intuitively reflect the dependency relationship between operators, which is beneficial to the subsequent division of the operator set.

Taking Figure 2 as an example, the mobile terminal uses a graph splitting algorithm to cut the directed graph between the 8 operators to be executed according to the dependency relationship between the 8 operators to be executed, specifically: The first node is cut from the second node and the fifth node, and the eighth node of the directed graph is cut from the fourth node and the seventh node, thereby cutting into 4 directed subgraphs. Among them, the first node, the second node, the third node, the fourth node, the fifth node, the sixth node, the seventh node, and the eighth node of the directed graph correspond to the first operator, the second operator, and the first operator, respectively. Three operators, fourth operators, fifth operators, sixth operators, seventh operators, and eighth operators. The four directed subgraphs are the first directed subgraph, the second directed subgraph, the third directed subgraph, and the fourth directed subgraph. The first directed subgraph includes only the first node of the directed graph; the second directed subgraph includes the second node, the third node, the fourth node, the connecting line between the second node and the third node, and the third node. The connection line between the node and the fourth node; the third directed subgraph includes the fifth node, the sixth node, the seventh node, the connection line between the fifth node and the sixth node, the sixth node and the seventh node The connecting line between; the fourth directed subgraph includes only the eighth node of the directed graph. Among them, the first directed subgraph has a dependency relationship with the second directed subgraph and the third directed subgraph, and the fourth directed subgraph has a dependency relation with the second directed subgraph and the third directed subgraph. The second directed subgraph and the third directed subgraph are independent of each other.

In the embodiment of the present application, the dependency relationship of the operators that need to be executed in the inference process of the neural network model is first calculated, and the operators to be executed are cut according to the dependency relationship. When the set of N operators obtained by the cut are independent of each other When the operator set is enabled, N threads are enabled to perform calculations on the operators in the N operator sets, and N threads can be enabled to calculate the operators in the N operator sets at the same time, which can improve the speed of neural network calculations , Thereby reducing the reasoning time of the neural network.

Please refer to FIG. 3. FIG. 3 is a schematic flowchart of another neural network calculation method based on a neural network algorithm framework disclosed in an embodiment of the present application. FIG. 3 is further optimized on the basis of FIG. 1. As shown in Figure 3, the neural network calculation method based on the neural network algorithm framework includes the following steps.

301. The mobile terminal obtains M to-be-executed operators, and calculates the dependency between the M to-be-executed operators, where N is an integer greater than or equal to 2.

302. The mobile terminal cuts the M to-be-executed operators according to the dependency between the M to-be-executed operators to obtain N operator sets, and each of the N operator sets includes at least 1 operator. Sub, N is an integer greater than or equal to 2.

303. If the N operator sets are mutually independent operator sets, the mobile terminal activates N threads to calculate operators in the N operator sets respectively.

For the specific implementation of step 301 to step 303 in the embodiment of the present application, reference may be made to the specific description of step 101 to step 103 shown in FIG. 1, which will not be repeated here.

304. If the N operator sets are not mutually independent operator sets, the mobile terminal adopts the forward and reverse alternating iterative scheduling algorithm to determine that the operator sums need to be executed in parallel among the N operator sets according to the dependency between the N operator sets. Need to execute operators serially.

305. The mobile terminal determines the execution order of the operators that need to be executed in parallel and the operators that need to be executed in series, and schedules the operators that need to be executed in parallel and the operators that need to be executed in series among the N operator sets for calculation.

In the embodiment of the present application, the forward and reverse alternate iterative scheduling algorithm, also known as the CAP-FB algorithm, is a node scheduling algorithm. The embodiment of the present application provides a node scheduling scheme so that the parallel execution time of operators is shorter and can improve The parallel execution speed of operators increases the speed of neural network calculations, thereby reducing the inference time of neural networks.

Figure 2 is used below to illustrate the need for parallel execution of operators and serial execution of operators in the N operator sets. In Figure 2, the first operator set includes the first operator, the second operator set includes the second operator, the third operator, and the fourth operator, and the third operator set includes the fifth operator and the sixth operator. The seventh operator, the fourth operator set includes the eighth operator. The order of execution among the eight operators is as follows: execute the first operator first, after the first operator is executed, execute the second and fifth operators in parallel; after the second operator is executed, execute the third operator Operator, after executing the third operator, execute the fourth operator; after executing the fifth operator, execute the sixth operator, after executing the sixth operator, execute the seventh operator; execute the fourth operator After the seventh operator and the seventh operator, the eighth operator is finally executed. Among them, the operators that need to be executed in series are the first operator set and the fourth operator set, and the operators that need to be executed in parallel are the second operator set and the third operator set.

It should be noted that FIG. 2 is a simple directed graph illustrated for ease of understanding. In the actual neural network calculation process, the number of operators is tens of thousands or more, and the dependence between operators is more complicated. It is necessary to use forward and backward alternating iterative scheduling algorithm to schedule the execution sequence of operators. , So as to achieve the best calculation speed.

Optionally, the mobile terminal schedules the operators that need to be executed in parallel and the operators that need to be executed in series in the N operator sets to perform calculations, specifically:

The mobile terminal determines the scheduling strategy. According to the scheduling strategy, the parallel execution operator and the serial execution operator need to be calculated in the N operator sets; the scheduling strategy includes any one of energy consumption priority strategy, speed priority strategy, and equilibrium strategy Kind.

Among them, the energy priority strategy is a strategy that mainly reduces computing energy consumption, and reduces computing energy consumption as much as possible; the speed priority strategy is a strategy that mainly increases computing speed, based on existing computing resources, to maximize Improve calculation speed. The equilibrium strategy is a strategy that takes into account both the calculation energy consumption and the calculation speed. On the premise that the calculation speed reaches a certain threshold, the calculation energy consumption is reduced as much as possible. Different scheduling strategies are suitable for different scenarios. For example, when the power of the mobile terminal is below a certain threshold, the energy consumption priority strategy can be adopted. When the mobile terminal does not have a higher priority calculation than the neural network calculation, the calculation priority strategy can be adopted. When the above two scenarios are not met, an equalization strategy can be adopted. The embodiments of the present application may adopt different scheduling strategies for different scenarios to meet the neural network computing requirements of different scenarios.

Optionally, before the mobile terminal determines the scheduling strategy, it may further include the following steps:

The mobile terminal obtains memory resources and processing circuit resources used for neural network calculations;

The mobile terminal determines the scheduling strategy, specifically:

The mobile terminal determines the scheduling strategy according to the memory resources and processing circuit resources used for neural network calculations.

In the embodiments of the present application, the mobile terminal may have dedicated computing resources for processing neural network calculations, or directly use a central processing unit to process neural network calculations. If the central processing unit is directly used to process neural network calculations, the memory resources and processing circuit resources allocated to neural network calculations by mobile terminals will be relatively limited. When the memory resources and processing circuit resources allocated to the neural network calculation are large, the speed priority strategy can be adopted. When the memory resources and processing circuit resources allocated to the neural network calculation are less, the energy consumption priority strategy or the equilibrium strategy can be adopted. The embodiment of the present application can adjust the scheduling strategy according to the different memory resources and processing circuit resources allocated to the neural network calculation to meet the neural network calculation requirements under different hardware resource conditions.

Optionally, before step 303 is performed, the following steps may be performed:

The mobile terminal estimates the estimated execution time of the first operator, where the first operator is an operator in any of the N operator sets;

Optionally, after step 303 is performed, the following steps may be performed:

The mobile terminal obtains the actual execution time of the first operator, and corrects the estimated execution time of the first operator.

In the embodiment of the present application, when the neural network model is run for the first time, since the execution time of each operator is different, even the same operator has different calculation data and its execution time is also different. Before the first operator is executed, the estimated execution time of the first operator is preset. Each time the first operator is executed, the actual execution time of the first operator will be obtained, and the first operator will be executed. The estimated execution time of the operator is revised once to gradually obtain the accurate estimated execution time of the first operator.

For example, take the neural network model to process the image as an example. Before the calculation of the first frame of image, it is assumed that the execution time of all operators is set to be the same as the base time. When the next frame of image is executed again, the actual execution time of this operator will be corrected (updated), and the executed image The more frames, the more accurate the execution time of the operator's correction, so that the execution time of the operator can be predicted more accurately, and accurate data can be provided for subsequent scheduling between operators, thereby improving the efficiency of operator scheduling execution.

In the embodiment of the present application, the dependency relationship of the operators that need to be executed in the inference process of the neural network model is first calculated, and the operators to be executed are cut according to the dependency relationship. When the set of N operators obtained by the cut are independent of each other When the operator set is enabled, N threads are enabled to perform calculations on the operators in the N operator sets, and N threads can be enabled to calculate the operators in the N operator sets at the same time, which can improve the speed of neural network calculations , Thereby reducing the reasoning time of the neural network. When the N operator sets are not mutually independent operator sets, according to the dependency between the N operator sets, the forward and reverse alternating iterative scheduling algorithm is used to determine the parallel execution of the operators and the serial execution of the N operator sets. Operators, determine the execution order of operators that need to be executed in parallel and operators that need to be executed in series, and schedule N operators in the set of operators that need to be executed in parallel and that need to be executed in series to perform calculations, you can use forward and backward alternate iteration scheduling algorithm The operator is scheduled to make the parallel execution time of the operator shorter, which can increase the parallel execution speed of the operator, thereby increasing the speed of the neural network calculation, thereby reducing the inference time of the neural network.

The foregoing mainly introduces the solution of the embodiment of the present application from the perspective of the execution process on the method side. It can be understood that, in order to implement the above-mentioned functions, the mobile terminal includes hardware structures and/or software modules corresponding to each function. Those skilled in the art should easily realize that in combination with the units and algorithm steps of the examples described in the embodiments disclosed herein, the present invention can be implemented in the form of hardware or a combination of hardware and computer software. Whether a certain function is executed by hardware or computer software-driven hardware depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.

The embodiments of the present application may divide the mobile terminal into functional units according to the foregoing method examples. For example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. It should be noted that the division of units in the embodiments of the present application is illustrative, and is only a logical function division, and there may be other division methods in actual implementation.

Please refer to FIG. 4, which is a schematic structural diagram of a neural network computing device disclosed in an embodiment of the present application. As shown in FIG. 4, the neural network computing device is applied to a neural network algorithm framework. The neural network algorithm framework includes a plurality of tensor units. The neural network computing device 400 includes a communication unit 401 and a processing unit 402, wherein:

The communication unit 401 is configured to obtain M operators to be executed;

The processing unit 402 is configured to calculate the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2; and to calculate the dependency relationship between the M to-be-executed operators. The M to-be-executed operators are cut to obtain N operator sets, each of the N operator sets includes at least one operator, and N is an integer greater than or equal to 2; and In the case that the N operator sets are mutually independent operator sets, N threads are activated to calculate the operators in the N operator sets respectively.

Optionally, the processing unit 402 cuts the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain N sets of operators, specifically: according to the M For the dependency relationship between the operators to be executed, the graph splitting algorithm is used to cut the M operators to be executed to obtain a set of N operators.

Optionally, after the processing unit 402 calculates the dependency relationship between the M to-be-executed operators, it is further configured to obtain the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators. Directed graph between sub-children;

The processing unit 402 uses a graph split algorithm to cut the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain a set of N operators, specifically: according to the For the dependency relationship between the M operators to be executed, the graph partitioning algorithm is used to cut the directed graph among the M operators to be executed to obtain N directed subgraphs; wherein, each directed subgraph The graph corresponds to a set of operators.

Optionally, the processing unit 402 is further configured to, in the case that the N operator sets are not mutually independent operator sets, adopt forward and backward alternation according to the dependency relationship between the N operator sets The iterative scheduling algorithm determines the operators that need to be executed in parallel and the operators that need to be executed serially in the set of N operators; determines the execution order of the operators that need to be executed in parallel and the operators that need to be executed in series, and schedules the N The operators that need to be executed in parallel and the operators that need to be executed in series in the set of operators perform calculations.

Optionally, the processing unit 402 schedules the operators that require parallel execution and the operators that require serial execution in the N operator sets to perform calculations, specifically: determining a scheduling strategy, and scheduling according to the scheduling strategy The operators that need to be executed in parallel and the operators that need to be executed in series in the set of N operators perform calculation; the scheduling strategy includes any one of an energy consumption priority strategy, a speed priority strategy, and an equalization strategy.

Optionally, the processing unit 402 is further configured to obtain memory resources and processing circuit resources used for neural network calculations before determining the scheduling strategy;

The processing unit 402 determines the scheduling strategy, specifically: determining the scheduling strategy according to the memory resources and processing circuit resources used for neural network calculations.

Optionally, the processing unit 402 is further configured to estimate the expected execution time of the first operator before the N threads are enabled to calculate the operators in the N operator sets. Is an operator in any one of the N operator sets;

The processing unit 402 is further configured to obtain the actual execution time of the first operator after the N threads are enabled to calculate the operators in the N operator sets, and to compare the first operator The estimated execution time of the

Wherein, the communication unit 401 in FIG. 4 may be a communication interface, and the processing unit 402 may be a processor. The neural network computing device shown in FIG. 4 may further include a storage unit 403, which may be a memory (for example, a non-volatile memory). Memory).

Implementing the neural network computing device shown in Figure 4 can calculate the dependencies of the operators that need to be executed during the inference process of the neural network model, and cut the operators to be executed according to the dependencies. When the N operators are cut When the set is a set of mutually independent operators, enable N threads to calculate the operators in the N operator sets respectively, and you can enable N threads to calculate the operators in the N operator sets at the same time, which can improve The calculation speed of the neural network reduces the inference time of the neural network.

Please refer to FIG. 5, which is a schematic structural diagram of a mobile terminal disclosed in an embodiment of the present application. As shown in FIG. 5, the mobile terminal 500 includes a processor 501 and a memory 502. The mobile terminal 500 may also include a bus 503. The processor 501 and the memory 502 may be connected to each other through the bus 503. The bus 503 may be a peripheral component. Connect the standard (Peripheral Component Interconnect, referred to as PCI) bus or extended industry standard architecture (Extended Industry Standard Architecture, referred to as EISA) bus, etc. The bus 503 can be divided into an address bus, a data bus, a control bus, and so on. For ease of presentation, only one thick line is used in FIG. 5 to represent, but it does not mean that there is only one bus or one type of bus. The mobile terminal 500 may also include an input and output device 504, and the input and output device 504 may include a display screen, such as a liquid crystal display screen. The memory 502 is used to store one or more programs containing instructions; the processor 501 is used to call the instructions stored in the memory 502 to execute some or all of the method steps in FIGS. 2 to 3.

Implementing the mobile terminal shown in Figure 5 can calculate the dependencies of operators that need to be executed during the inference process of the neural network model, and cut the operators to be executed according to the dependencies. When the set of N operators obtained by cutting is When mutually independent operator sets are enabled, N threads are enabled to calculate the operators in the N operator sets, and N threads can be enabled to calculate the operators in the N operator sets at the same time, which can improve the neural network The speed of calculation reduces the reasoning time of the neural network.

An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables the computer to execute any neural network-based algorithm framework described in the above method embodiments. Part or all of the steps of the neural network calculation method.

The embodiments of the present application also provide a computer program product. The computer program product includes a non-transitory computer-readable storage medium storing a computer program. The computer program is operable to cause a computer to execute any of the methods described in the foregoing method embodiments. Part or all of the steps of a neural network calculation method based on the neural network algorithm framework.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present invention is not limited by the described sequence of actions. Because according to the present invention, certain steps can be performed in other order or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the involved actions and modules are not necessarily required by the present invention.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed device may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other various media that can store program codes.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable memory, and the memory can include: flash disk , Read-only memory (English: Read-Only Memory, abbreviation: ROM), random access device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disc, etc.

The embodiments of the present application are described in detail above, and specific examples are used in this article to illustrate the principles and implementation of the present invention. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present invention; Persons of ordinary skill in the art, based on the idea of the present invention, will have changes in the specific implementation and the scope of application. In summary, the content of this specification should not be construed as limiting the present invention.

Claims

A neural network calculation method based on a neural network algorithm framework, which is characterized in that it includes:

Acquiring M to-be-executed operators, and calculating the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2;

Cut the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain N operator sets, each of the N operator sets includes at least 1 Operators, N is an integer greater than or equal to 2;

If the N operator sets are mutually independent operator sets, N threads are activated to perform calculations on the operators in the N operator sets respectively.
The method according to claim 1, wherein the cutting the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain a set of N operators comprises:

According to the dependency relationship between the M to-be-executed operators, a graph partition algorithm is used to cut the M to-be-executed operators to obtain N sets of operators.
The method according to claim 2, characterized in that, after said calculating the dependency between the M operators to be executed, the method further comprises:

Obtaining a directed graph between the M operators to be executed according to the dependency relationship between the M operators to be executed;

According to the dependency relationship between the M to-be-executed operators, using a graph splitting algorithm to cut the M to-be-executed operators to obtain a set of N operators includes:

According to the dependency relationship between the M to-be-executed operators, the graph partitioning algorithm is used to cut the directed graph among the M to-be-executed operators to obtain N directed subgraphs; wherein, each The directed subgraph corresponds to a set of operators.
The method according to any one of claims 1 to 3, wherein the method further comprises:

If the N operator sets are not mutually independent operator sets, according to the dependency relationship between the N operator sets, the forward and backward alternating iterative scheduling algorithm is used to determine that the N operator sets need to be executed in parallel. Sub and need to execute the operator serially;

The execution order of the operators requiring parallel execution and the operators requiring serial execution is determined, and the operators requiring parallel execution and the operators requiring serial execution in the set of N operators are scheduled for calculation.
The method according to claim 4, wherein the scheduling of the operators requiring parallel execution and the operators requiring serial execution in the N operator sets for calculation comprises:

Determine a scheduling strategy, and schedule the operators that need to be executed in parallel and the operators that need to be executed in series in the N operator sets to perform calculations according to the scheduling strategy;

The scheduling strategy includes any one of an energy consumption priority strategy, a speed priority strategy, and a balance strategy.
The method according to claim 5, characterized in that, before the determining the scheduling strategy, the method further comprises:

Obtain memory resources and processing circuit resources for neural network calculations;

The determined scheduling strategy includes:

The scheduling strategy is determined according to the memory resources and processing circuit resources used for neural network calculations.
The method according to any one of claims 1 to 6, characterized in that, before the enabling N threads to calculate the operators in the N operator sets respectively, the method further comprises:

Estimating the estimated execution time of the first operator, the first operator being an operator in any one of the N operator sets;

After the activation of the N threads to calculate the operators in the N operator sets, the method further includes:

The actual execution time of the first operator is acquired, and the estimated execution time of the first operator is revised.
The method according to claim 1, wherein the neural network algorithm framework comprises a controller unit, an arithmetic unit, and a storage unit, the controller unit is used to store instructions and processing instructions, and the arithmetic unit is used to The operator performs calculations, and the storage unit is used to store neurons and weights.
The method according to claim 1, wherein the operator to be executed includes any of Conv2D operator, FusedBatchNorm operator, Relu operator, DepthwiseConv2dNative operator, MaxPool operator, BiasAdd operator, and ConcatV2 operator. One kind.
A neural network computing device, characterized in that the neural network computing device includes a communication unit and a processing unit, wherein:

The communication unit is used to obtain M operators to be executed;

The processing unit is configured to calculate the dependency relationship between the M to-be-executed operators, where N is an integer greater than or equal to 2; The M to-be-executed operators are cut to obtain N operator sets, each of the N operator sets includes at least one operator, and N is an integer greater than or equal to 2; and In a case where the N operator sets are mutually independent operator sets, N threads are activated to perform calculations on the operators in the N operator sets respectively.
The device according to claim 10, wherein the processing unit cuts the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators to obtain a set of N operators, Specifically:

According to the dependency relationship between the M to-be-executed operators, a graph partition algorithm is used to cut the M to-be-executed operators to obtain N sets of operators.
The device according to claim 11, wherein after the processing unit calculates the dependency relationship between the M to-be-executed operators, it is further configured to calculate the dependency relationship between the M to-be-executed operators Obtaining a directed graph between the M to-be-executed operators;

The processing unit cuts the M to-be-executed operators according to the dependency relationship between the M to-be-executed operators by using a graph splitting algorithm to obtain a set of N operators, specifically:

According to the dependency relationship between the M to-be-executed operators, the graph partitioning algorithm is used to cut the directed graph among the M to-be-executed operators to obtain N directed subgraphs; wherein, each The directed subgraph corresponds to a set of operators.
The device according to any one of claims 10 to 12, wherein the processing unit is further configured to: when the N operator sets are not mutually independent operator sets, according to the N operators For the dependency relationship between the subsets, the forward and reverse alternating iterative scheduling algorithm is used to determine the operators that need to be executed in parallel and the operators that need to be executed serially in the N operator sets; it is determined that the operators that need to be executed in parallel and the need for serial The execution order of row execution operators is to schedule the operators that need to be executed in parallel and the operators that need to be executed in series in the N operator sets to perform calculations.
The apparatus according to claim 13, wherein the processing unit schedules the operators that require parallel execution and the operators that require serial execution in the N operator sets to perform calculations, specifically: determining scheduling A strategy, scheduling the operators requiring parallel execution and the operators requiring serial execution in the N operator sets to perform calculations according to the scheduling strategy;

The scheduling strategy includes any one of an energy consumption priority strategy, a speed priority strategy, and a balance strategy.
The device according to claim 14, wherein before the processing unit determines the scheduling strategy, it is further configured to obtain memory resources and processing circuit resources for neural network calculations;

The processing unit determines the scheduling strategy, specifically:

The scheduling strategy is determined according to the memory resources and processing circuit resources used for neural network calculations.
The apparatus according to any one of claims 10 to 15, wherein the processing unit is further configured to estimate the first set of operators before the N threads are enabled to calculate the operators in the N operator sets. An estimated execution time of an operator, where the first operator is an operator in any one of the N operator sets;

The processing unit is further configured to obtain the actual execution time of the first operator after enabling N threads to perform calculations on the operators in the N operator sets, and to predict the first operator The execution time is corrected.
The device according to claim 10, wherein the neural network algorithm framework comprises a controller unit, an arithmetic unit, and a storage unit, the controller unit is used to store instructions and processing instructions, and the arithmetic unit is used to The operator performs calculations, and the storage unit is used to store neurons and weights.
The device according to claim 10, wherein the operator to be executed comprises any of Conv2D operator, FusedBatchNorm operator, Relu operator, DepthwiseConv2dNative operator, MaxPool operator, BiasAdd operator, and ConcatV2 operator. One kind.
A mobile terminal, characterized by comprising a processor and a memory, the memory is used to store one or more programs, the one or more programs are configured to be executed by the processor, and the programs include Perform the method of any one of claims 1-9.
A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program for electronic data exchange, wherein the computer program causes a computer to execute any one of claims 1-9 method.