CN110659069B

CN110659069B - Instruction scheduling method for performing neural network computation and corresponding computing system

Info

Publication number: CN110659069B
Application number: CN201810690479.4A
Authority: CN
Inventors: 于谦; 隋凌志; 方绍峡; 王俊斌; 单羿
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2022-08-19
Anticipated expiration: 2038-06-28
Also published as: CN110659069A

Abstract

An instruction scheduling method for performing neural network computation and a corresponding neural network computing system thereof are provided. The method comprises the following steps: obtaining instructions for performing neural network computations; executing the current instruction by using the first functional module; and starting the execution of the subsequent instruction by using a second functional module before the current instruction is completely executed at least based on the parameter information of the current instruction and the dependency information of the subsequent instruction directly dependent on the current instruction. Therefore, by starting the execution of the subsequent dependent instructions before the execution of the current instruction is completed, higher parallelism between the instructions with the dependency relationship can be realized in the computing system, and the computing efficiency of the neural network computing is improved as a whole.

Description

Instruction scheduling method for performing neural network computation and corresponding computing system

Technical Field

The invention relates to the field of deep learning, in particular to an instruction scheduling method for executing neural network computation and a corresponding computing system.

Background

Artificial intelligence is rapidly developed in recent years, and the method has good application effects in the fields of image classification, detection, video and voice processing and the like, and still has great development prospects. The neural network is the core of artificial intelligence application, and the deep learning neural network algorithm is one of the most common neural network models. In order to deploy the deep neural network after training, the neural network algorithm needs to be compiled into instruction codes which can be executed by the neural network computing system.

A big feature of the neural network algorithm is data driving, and the compiled instruction code contains a large amount of data and a relatively small amount of control logic instructions, so how to execute the input instructions for the large amount of data with higher efficiency and better adaptability becomes a great challenge.

Therefore, there is a need for an improved instruction scheduling scheme for neural network computations.

Disclosure of Invention

In order to solve at least one of the above problems, the present invention provides an instruction scheduling method for performing neural network computation and a corresponding computing system, which can achieve higher parallelism between instructions having dependency relationships by starting execution of subsequent dependent instructions before the current instruction is executed, thereby improving the computation efficiency of the neural network computation as a whole.

According to one aspect of the present invention, an instruction scheduling method for performing neural network computations is provided, including: obtaining instructions for performing neural network computations; executing the current instruction by using the first functional module; and starting the execution of the subsequent instruction by using a second functional module before the current instruction is completely executed based on at least parameter information of the current instruction and dependency information of the subsequent instruction directly dependent on the current instruction.

Therefore, the execution start time of the subsequent instruction can be advanced, so that the whole instruction execution is more compact, and the whole calculation efficiency of the system is improved.

In particular, execution of the current instruction may be divided into two phases, dependent and independent, based on at least type information of the current instruction and the subsequent instruction; directly generating the current instruction end marker if the dependent phase has completed; and executing the subsequent instruction by using a second functional module at least based on the current instruction end mark. Therefore, the high-parallelism instruction execution among the functional modules in the computing system can be realized by timely and early sending of the end mark.

In particular, execution of the current instruction may be divided into a plurality of stages based at least on the parameter information and the dependency information; generating a phase end marker if at least one of the plurality of phases has completed; and executing the subsequent instruction using a second functional module based at least on the stage end indication. Therefore, finer-grained dependent execution among dependent instructions can be realized based on finer-grained division of the instructions in the computing system. The granularity of the plurality of stages of partitioning may be determined based at least on the granularity of the instructions for performing neural network computations and parameters of a computing system for performing the neural network computations. Preferably, the subsequent instruction may be executed using a second functional module based on data obtained from the at least one stage that has been completed.

The instructions acquired to perform neural network computations may include: data loading instructions for loading data for neural network computation from an external memory to an internal cache, the data for neural network computation including parameter data and feature map data; reading the parameter data and the feature map data from the internal cache to perform operation and storing an operation result back in a data operation instruction of the internal cache; and a data store instruction to store the result of the operation from the internal cache back to the external memory. The instruction scheduling method is particularly suitable for neural network reasoning calculation mainly comprising the instruction types.

In particular, a current data load instruction may be executed using a data load engine; and before the indication information that the current data loading instruction is executed is obtained, in response to the completion of the loading of the weight and the feature map data for at least one complete arithmetic unit, starting the execution of the data arithmetic instruction by using a data arithmetic engine.

In particular, a current data operation instruction may be executed using a data operation engine; and before acquiring indication information that the current data operation instruction is executed completely, responding to generation of at least one operation final result, caching the at least one operation final result into the internal cache, and starting execution of the data storage instruction by using a data storage engine to store the at least one operation final result from the internal cache back to the external memory.

In particular, the current data storage instruction may also be executed using a data storage engine; and in response to an absence of a dependency of output profile data stored back to the external memory by a current data store instruction on input profile data to be loaded from the external memory by a data load instruction directly dependent on the current data store instruction, commencing execution of the data load instruction using a data load module after the output profile data is written to a bus buffer.

Explicit dependency information of the instruction with other instructions may be included in the fetched instruction, and the explicit dependency information in the current instruction is used as dependency information for subsequent instructions that directly depend on the current instruction.

According to another aspect of the present invention, a neural network computing system is provided, comprising: a plurality of functional modules that perform respective functions based on instructions for performing neural network computations; an internal cache for caching data required for performing neural network computations; and a controller for: executing the current instruction by using the first functional module; and starting the execution of the subsequent instruction by using a second functional module before the current instruction is completely executed based on at least parameter information of the current instruction and dependency information of the subsequent instruction directly dependent on the current instruction.

Preferably, the controller may be further configured to: dividing the execution of the current instruction into two phases of dependence and no dependence at least based on the type information of the current instruction and the subsequent instruction; directly generating the current instruction end marker if the dependent phase has completed; and executing the subsequent instruction by using a second functional module at least based on the current instruction end mark.

Preferably, the controller may be further configured to: dividing execution of the current instruction into a plurality of stages based at least on the parameter information and the dependency information; generating a phase end marker if at least one of the plurality of phases has completed; and executing the subsequent instruction using a second functional module based at least on the stage end indication.

Preferably, the controller may be further configured to: executing the subsequent instruction using a second functional module based on the data from the completed at least one stage.

The granularity of the plurality of stages of partitioning is determined by the controller based at least on the granularity of the instructions for performing neural network computations and parameters of a computing system for performing the neural network computations.

Preferably, the plurality of functional modules may include: a data load engine to execute a data load instruction to load data for neural network computations from an external memory to an internal cache, the data for neural network computations including parameter data and feature map data; the data operation engine is used for reading the parameter data and the characteristic diagram data from the internal cache to perform operation and storing an operation result back in the data operation instruction of the internal cache; and a data storage engine to execute a data storage instruction to store the operation result from the internal cache back to the external memory.

Preferably, the first functional module may be a data load engine, the second functional module may be a data operation engine, and the data operation engine starts execution of the data operation instruction in response to the data load engine loading weight and feature map data that completes at least one complete operation unit.

Preferably, the first functional module may be a data operation engine, the second functional module may be a data storage engine, and in response to the data operation engine generating at least one operation final result and caching the at least one operation final result to the internal cache, the data storage engine initiates execution of the data storage instruction to store the at least one operation final result from the internal cache back to the external memory.

Preferably, the first functional module may be a data store engine, the second functional module may be a data load engine, and in response to the data store engine storing back to the external memory output profile data that is not dependent on input profile data to be loaded from the external memory by a data load instruction that is directly dependent on the current data store instruction, the execution of the data load instruction is commenced using the data load module after the output profile data is written to a bus buffer.

Explicit dependency information of the instruction with other instructions may be included in the instructions for performing neural network computations, and the controller uses the explicit dependency information in the current instruction as dependency information for the subsequent instructions that directly depend on the current instruction.

The computing system may be implemented at least in part by a GPU, FPGA, or ASIC.

The invention provides an instruction scheduling method in a neural network computing system and a corresponding computing system, which can realize higher parallelism among instructions with dependency relationship by starting the execution of subsequent dependent instructions before the execution of the current instruction is finished, thereby improving the computing efficiency of the neural network computing as a whole. Specifically, the advanced execution of the subsequent instructions by other modules can be realized by issuing an instruction end mark in advance, and fine-grained dependence among the instructions can also be realized by multi-stage division with finer granularity, so that the parallelism among the functional modules is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows an example of the configuration of a typical CNN.

Fig. 2 shows a typical operation example of one convolutional layer in a neural network.

Fig. 3 shows an example of the convolution operation.

Fig. 4 shows a compilation diagram of an existing neural network compiler.

FIG. 5 shows a flow diagram of an instruction scheduling method for performing neural network computations, according to one embodiment of the present invention.

FIG. 6 illustrates a schematic diagram of a computing system for performing neural network computations, in accordance with one embodiment of the present invention.

FIG. 7 illustrates a schematic diagram of a computing system for performing neural network computations, according to another embodiment of the present invention.

Fig. 8A and 8B show examples of instruction execution statuses with dependencies in the prior art and the present invention.

Figure 9 shows an example of a SoC that may be used to implement the present invention involving neural network computations.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Artificial intelligence is rapidly developed in recent years, and the method has good application effects in the fields of image classification, detection, video and voice processing and the like, and still has great development prospects. The neural network is the core of artificial intelligence application, and the deep learning neural network algorithm is one of the most common neural network models. The workload characteristics of neural networks are computational and data intensive. The multiplication and addition operation required by the neural network calculation is usually in the order of G, for example, the calculation amount of the target detection type neural network SSD is 120G operation times. The parameters required for calculation are typically in the order of M to hundreds of mbytes, for example, the parameters of the classification neural network VGG are 480 mbytes.

Common Artificial Neural Networks (ANN) include Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN). The following is a description with a certain degree of background using CNN as an example.

CNN basic concept

As shown in fig. 1, a typical CNN consists of a series of layers (layers) that run in order.

The CNN neural network is composed of an input layer, an output layer and a plurality of hidden layers which are connected in series. The first layer of the CNN reads an input value, such as an input image, and outputs a series of activation values (which may also be referred to as a feature map). The lower layer reads the activation value generated by the previous layer and outputs a new activation value. The last classifier (classifier) outputs the probability of each class to which the input image may belong.

These layers can be roughly divided into weighted layers (e.g., CONV layers, fully connected layers, bulk normalization layers, etc.) and unweighted layers (e.g., pooling layers, ReLU layers, Softmax layers, etc.). The CONV layers (Convolutional layers) take a series of feature maps as input, and convolution kernels convolve to obtain output activation values. A Pooling layer (Pooling layer) is typically connected to the CONV layer for outputting a maximum or average value of each partition (sub area) in each feature map, thereby reducing the amount of computation by sub-sampling while maintaining some degree of displacement, scale and deformation invariance. Multiple alternations between convolutional and pooling layers may be included in a CNN, thereby gradually reducing the spatial resolution and increasing the number of feature maps. The CONV layers can also be directly connected without a pooling layer. It can then be connected to at least one full connection layer (FC), resulting in a one-dimensional vector output comprising a plurality of eigenvalues, by means of a linear transformation applied on the input eigenvectors.

In general, the operation of a weighted layer can be represented as:

Y＝WX+b，

where W is the weight value, b is the bias, X is the input activation value, and Y is the output activation value.

The operation of the unweighted layer can be represented as:

Y＝f(X)，

wherein f (X) is a non-linear function.

Here, "weights" (weights) refer to parameters in the hidden layer. In a CNN network, the weights can be considered as convolution kernels that can vary in size for each convolutional layer, and also in value for each channel of each convolutional layer. It is to be understood in a broad sense that the weights may also include biases and are values learned through the training process and remain unchanged at the time of inference. In addition, the CNN may also include parameters for performing other operations, such as parameters required for various types of operations by the layer without weighting. The activation value refers to a value, also referred to as a feature value, transferred between layers, starting from an input layer, and an output of each layer is obtained by an operation of the input value and a weight value. Unlike the parameter values, the distribution of activation values may vary dynamically depending on the input data sample.

As shown, each layer from the input feature map (input image) has a plurality of channels (channels) to characterize different features of the input image before the feature values are fed into the FC layer. When the color image is input, the initial input feature map usually has three channels of RGB, the feature values and convolution kernels with the same size but different values in different channels in the same Layer are respectively subjected to convolution calculation to generate the output feature value of the Layer, and then the feature value is sent to the next CONV Layer (Layer 1) with the number of channels and the size of the convolution kernels being different for further feature extraction. The above process is repeated until the output of Layer 7 is fed to the FC Layer. As shown, W, H and C in the input feature map refer to the three dimensions width, height, and channel, respectively. The above arrows may refer to a specific order of computation or degree of computational parallelism (especially in the case of computations on high-parallelism computing platforms).

The first FC layer may be a fully-connected layer for extracting features of individual channels as one-dimensional feature vector. The second FC layer may then be a classifier for classification.

Operation of the convolutional layer

Whether DNN, RNN or CNN, a typical neural network model, especially for computer vision applications, includes multiple CONV layers as shown in fig. 1. For each CONV layer, higher level abstract data is extracted from the input profile data to preserve important and unique information in the input data. Modern DNNs are able to achieve excellent visual performance by utilizing deep levels (e.g., hundreds of convolutional layers).

Fig. 2 shows a typical operation example of one convolutional layer in a neural network. The same applies to a fully connected layer, such as the FC layer shown in fig. 1. The three-dimensional input to each convolutional layer is a two-dimensional feature map (W × H) with multiple channels (C). The first input to a neural network that performs visual processing is typically a two-dimensional image with three color channels of RGB. A plurality of three-dimensional filters (M filters with R × S × C dimensions, which may also be referred to as convolution kernels) are then convolved with the input feature map, and each filter may generate one channel of the output three-dimensional feature map (two-dimensional E × F feature map with M channels). The same set of M filters can be applied to a batch (B) with N input profiles. Thus, N input profiles can obtain N output profiles (batch B may also be considered as the fourth dimension of the input here). In addition, a 1-dimensional bias (not shown in FIG. 2) may be applied to the filtered results.

Fig. 3 shows an example of the convolution operation. This convolution operation can be regarded as a convolution of the two-dimensional filter (R × S) and the two-dimensional feature map (W × H) on one channel C. As shown in fig. 3, a 5x5(W × H) feature map is convolved with step size 1 using a 3x3(R × S) convolution kernel. The left side of the figure shows the first convolution calculation, the middle shows the second convolution calculation, and so on. As can be seen from the definition of convolution calculation, each specific convolution calculation can be decomposed into multiple multiply-add calculations. After 9 convolution calculations, the convolved 3x3 feature map on the right side of fig. 3 is obtained. There is no dependency between these 9 convolution calculations, so when performing calculations with a high-parallelism computing platform, the execution can be completed in a single operation (parallelism M can typically reach thousands of orders of magnitude). Fig. 3 can be regarded as a convolution operation of one channel C of a plurality of channels of the CONV layer, and the feature map of one channel of the M channels of the output three-dimensional feature map can be obtained only after the convolution operation of all the channels C and the subsequent addition operation are completed. Further, the output three-dimensional feature map (two-dimensional E x F feature map with M channels) is only one of the N output three-dimensional feature maps in the batch.

Deployment of neural networks

Before deployment uses CNN for reasoning (e.g., image classification), CNN needs to be trained first. Parameters, such as weights and biases, of the various layers of the neural network model are determined through a large import of training data.

In order to deploy the deep neural network after training, a compiler is required to compile a neural network algorithm into a binary instruction stream that can be executed by a computing platform. Unlike applications developed using high level languages such as C + + or Java, neural network algorithms have their own unique syntax and structure. In view of this, high performance computing platforms dedicated to neural network computing and corresponding neural network compilers have emerged. For example, a Deep Neural Network compiler dnnc (Deep Neural Network compiler) may compile Neural Network algorithms into an optimized instruction stream for a DPU (Deep Learning Processor Unit) platform. Control flow and data flow information in the IR (intermediate representation) and the IR (intermediate representation) of an internal calculation graph of the compiler are built by analyzing the topological structure of the neural network, and the neural network compiler applies various compilation optimization and transformation technologies based on the IR, so that the memory access bandwidth and power consumption requirements of the system are effectively reduced while the DPU computing performance is improved. Fig. 4 shows a compilation diagram of an existing neural network compiler. As shown in fig. 4, a specialized neural network algorithm (e.g., for pruned CNNs) may be fed into a neural network compiler that includes a compilation front-end, an optimizer, and an instruction generator, and generates binary instruction code for a neural network computing platform (e.g., DPU).

Herein, "compilation" refers to the process of generating low-level object code executing on a computing platform from a representation described by a high-level formalization method using a compiler. Since only binary instruction codes are involved in the processing of a hardware computing platform, a compiler is required to convert the familiar high-level language description into computer-readable low-level binary code. Unlike source program code described using high-level programming languages such as C/C + +, a neural network needs to be represented by a specialized model that describes neural network algorithms. The neural network algorithm includes a topology of the neural network algorithm and parameters of the neural network algorithm. In contrast, the formal description of the neural network topology requires much less memory than the massive number of neural network algorithm parameters.

Herein, a neural network computing system refers to a hardware platform dedicated to performing neural network inference computations, which may also be referred to as a neural network computing platform, and may be implemented as a neural network-dedicated processor, such as the DPU described above.

Instruction scheduling techniques of the present invention

The computer architecture can be divided into four basic types, single instruction single data stream, single instruction multiple data stream, multiple instruction single data stream, and multiple instruction multiple data stream, according to the concept of instruction and data streams. The single instruction single data stream type is a traditional architecture whose hardware does not support any form of parallel computation, all instructions are executed serially, and early computers mostly adopt such architecture. The single instruction multiple data stream architecture is commonly used in the fields of digital signal processing, image processing, multimedia information processing and the like, and one instruction corresponds to a plurality of parallel data processing units. The multiple instruction single data stream architecture is not practical, because a computing system adopting the multiple instruction architecture corresponds to a plurality of parallel data streams, the multiple instruction multiple data stream architecture is more widely applied.

In neural network reasoning applications, a multiple instruction multiple data stream architecture is often adopted due to the high computational parallelism (as described above with reference to fig. 2 and 3) and the need for multiple interdependent acceleration engines to work together. Scheduling among the various instruction streams in the architecture often determines the efficiency of multiple data streams. The invention provides an efficient instruction scheduling mode, which can effectively improve the instruction execution efficiency of the system.

FIG. 5 shows a flow diagram of an instruction scheduling method for performing neural network computations, according to one embodiment of the present invention. It should be understood that the instruction scheduling method described above may be implemented by a computing system for performing neural network computations, such as by the deep learning special purpose processor (DPU) described above or other hardware platform for performing neural network reasoning.

In step S510, an instruction for performing neural network computation is acquired. The instructions fetched here may be compiled binary instruction codes executable by the DPU as shown in fig. 4.

In step S520, the current instruction is executed using the first functional module. Subsequently, in step S530, based on at least the parameter information of the current instruction and the dependency information of the subsequent instruction directly dependent on the current instruction, before the current instruction finishes executing, the execution of the subsequent instruction is started using a second functional module.

In a multiple instruction multiple data stream system, such as a neural network computing system, there are often two or more functional modules (e.g., acceleration engines) that each execute instructions corresponding thereto. Each functional module can execute respective instructions in parallel, and the instructions of different functional modules have certain dependency relationship. According to the method, the parallelism of finer granularity between the dependent instructions can be utilized, so that the subsequent instructions can be executed depending on a part of the current instruction, the overlapping degree of instruction execution is improved, and the execution efficiency of neural network reasoning calculation is improved on the whole.

In one embodiment, for step S530, further comprising: dividing the execution of the current instruction into two phases of dependence and no dependence at least based on the type information of the current instruction and the subsequent instruction; directly generating the current instruction end marker if the dependent phase has completed; and executing the subsequent instruction by using a second functional module at least based on the current instruction end mark. In this case, the instruction end flag may be issued in advance after the current instruction is executed in the stage that is actually depended on, so as to start the execution of the subsequent instruction.

In another embodiment, for step S530, further comprising dividing execution of the current instruction into a plurality of stages based on at least the parameter information and the dependency information; generating a phase end marker if at least one of the plurality of phases has completed; and executing the subsequent instruction using a second functional module based at least on the stage end indication. In other words, the processor can further divide the acquired instructions into smaller granularities based on the parameters of the acquired instructions and the execution sequence and the dependency relationship among the instructions in the actual neural network inference calculation execution process, and enable the subsequent instructions with the dependency relationship to be executed in advance based on the dependency between the smaller granularities through an end marker generated inside the processor and aiming at the execution completion of the smaller granularities, thereby improving the overall efficiency of the neural network calculation.

Dependencies between instructions typically include dependencies on the results of previous instruction execution or on hardware needed for previous instruction execution. Executing the subsequent instruction using the second functional module, with a dependency on the execution result, based at least on the indication of the completion of at least one of the plurality of stages, may include: and executing the subsequent instruction by using a second functional module based on the data obtained by the completed at least one stage.

For neural network reasoning, the operation involved in the neural network reasoning is relatively simple (the type of a layer involved in a neural network model algorithm is limited), the data size is huge, and the parallelism degree of each dimension is flexibly set, so that the granularity of the neural network calculation operation instruction acquired by a neural network calculation platform is large. The instruction with large granularity enables the neural network model to have wider adaptability to various neural network computing platforms (such as neural network special processors), and provides space for the implementation of finer-grained operations of the computing platforms.

FIG. 6 illustrates a schematic diagram of a computing system for performing neural network computations, according to one embodiment of the present invention. As shown, the neural network computing system 600 may include a number of functional modules 610, an internal cache 620, and a controller 630.

The plurality of functional modules 610 may be a plurality of functional modules that perform respective functions based on the acquired instructions for performing neural network computations. The internal cache 620 may cache data needed to perform neural network computations. The controller 630 is used to control the operations of the plurality of functional modules 610 and the internal cache 620. The thin arrows in the figure indicate the sending of control commands, and the thick arrows indicate the passing of data. "plurality" of the plurality of functional modules refers to two or more, and although three functional modules are shown, it should be understood that the computing system 600 may have more or fewer functional modules depending on the particular application.

The controller 630 may be configured to execute a current instruction using a first functional module, and may start execution of a subsequent instruction directly dependent on the current instruction using a second functional module before the current instruction finishes execution based on at least parameter information of the current instruction and dependency information of the subsequent instruction. Here, the first and second functional modules may be any of the plurality of functional modules 610. "first" and "second" are used merely to distinguish between different modules and do not imply any order or importance to the modules. It should also be understood that as the instructions are executed, the roles among the plurality of functional modules may change, in other words, which functional module is the first functional module executing the current instruction and which functional module is the second functional module to execute the subsequent instruction may be determined according to the instruction currently being executed.

In one embodiment, controller 630 may be a control module for performing instruction fetching and dispatching. Thus, executing the current instruction using the first functional module may be understood as the controller 630 causing the first functional module to execute the current instruction by distributing the current instruction to the first functional module. The controller 630 is configured to start the execution of the subsequent instruction by using the second functional module before the current instruction is completely executed, which may be understood that the controller 630 obtains an instruction end indication issued by the first functional module in advance before the current instruction is actually completely executed, and starts the execution of the subsequent instruction by the second functional module in advance by sending the subsequent instruction to the second functional module in advance.

The data read from the external memory by internal cache 620 may also typically include instruction data. Thus, in one embodiment, particularly where controller 630 is implemented as a control module that performs instruction fetching and dispatching, controller 630 may read instruction data from internal cache 620.

Explicit dependency information of an instruction with other instructions may be included in the instructions for performing neural network computations. The explicit dependency information may be compiled by a specialized neural network compiler based on an input neural network algorithm model, for example, at an instruction compilation stage. In the case where the controller 630 is a control module for performing instruction fetching and distribution, the controller 630 may acquire the compiled instructions for performing neural network computations while acquiring the above-described explicit dependency information, and may use the above-described information as dependency information for subsequent instructions that directly depend on the current instruction.

In one embodiment, the controller 630 may be further configured to: dividing the execution of the current instruction into two phases of dependence and no dependence at least based on the type information of the current instruction and the subsequent instruction; directly generating the current instruction end marker if the dependent phase has completed; and executing the subsequent instruction by using a second functional module at least based on the current instruction end mark. In the case that the controller 630 is a control module for performing instruction reading and distributing, the above-mentioned phase division may also be implemented by a special neural network compiler at, for example, an instruction compiling stage, the first functional module may directly generate the current instruction ending indicator after the dependent phase is completed, and the instruction reading and distributing control module 630 may directly distribute the subsequent instruction to the second functional module for execution after receiving the ending indicator.

In one embodiment, the controller 630 may be further configured to: dividing execution of the current instruction into a plurality of stages based at least on the parameter information and the dependency information; and generating a phase end marker if at least one of the plurality of phases has been completed; and executing the subsequent instruction using a second functional module based at least on the stage end indication. Similarly, in the case where the controller 630 is a control module for executing instruction reading and distributing, the above-mentioned phase division may also be implemented by a special neural network compiler at, for example, an instruction compiling stage, a first functional module may generate a phase end indicator after a certain phase is completed, and the instruction reading and distributing control module 630 may directly distribute a subsequent instruction to a second functional module for execution after receiving the phase end indicator. Subsequently, the first functional module may, for example, continuously send an end indication of completion of each phase of execution to the second functional module, so that the second functional module can perform corresponding fine-grained dependent operations.

In one embodiment, the plurality of functional modules 610 may be more specific acceleration engines. FIG. 7 illustrates a schematic diagram of a computing system for performing neural network computations, in accordance with an embodiment of the present invention. The neural network computing system 700 of fig. 7 also includes an internal cache 720 and a controller 730. Further, the functional modules of the computing system 700 may be a data loading engine 711, a data operation engine 712 and a data storage engine 713, respectively. Three engines share an internal cache 720 and data load engine 711 and data store engine 713 may interact with external memory 740 via a bus or other communication mechanism, for example.

The data load engine 711 may execute a data load instruction that loads data for neural network computations from external memory into an internal cache. The loaded data may include parametric data and profile data. The parametric data may include weight data (e.g., convolution kernels) and other parameters such as offsets. The feature map data may include input image data and may also include intermediate calculation results for each convolution layer. Data operation engine 712 may execute data operation instructions that read the weight data and feature map data from internal cache 720 to perform operations and store the results of the operations back in internal cache 720. Data storage engine 713 may then execute data storage instructions that store the operation results from internal cache 720 back to external memory 740. It is understood that the data load engine 711, the data operation engine 712 and the data store engine 713 implement respective instruction functions under the schedule of the internal controller 730.

Accordingly, the obtained instructions for neural network computation may include data loading instructions for loading data for neural network computation from an external memory to an internal cache, the data for neural network computation including parameter data and feature map data; reading the parameter data and the feature map data from the internal cache to perform operation and storing an operation result back in a data operation instruction of the internal cache; and a data store instruction to store the operation result from the internal cache back to the external memory.

Fig. 8A and 8B show the state of the art and the present invention with dependent instruction execution. As shown in fig. 8A, in the prior art, it is usually necessary for the previous functional module to complete the current instruction before the execution of the next instruction depending on the execution result of the current instruction can be started. For example, when a data loading engine loads data based on a current data loading instruction, the data operation engine may start executing a data operation instruction based on the loaded data only after receiving indication information that the data loading engine sends out completion of the current instruction.

In a computing system that utilizes the instruction scheduling principles of the present invention, such as a neural network dedicated processor, other engines may be used to begin execution of the subsequent instruction before the current instruction has completed execution, as shown in FIG. 8B. Therefore, the overall computing efficiency of the computing system is improved by partially overlapping and executing the instructions which originally have the dependency relationship in time.

Returning to FIG. 5, at step S530, execution of the current instruction may be divided into a plurality of phases based on the parameter information and the dependency information; generating a phase end marker if at least one of the plurality of phases has completed; and executing the subsequent instruction using a second functional module based at least on the stage end indication. A granularity size of the plurality of stages of partitioning is determined based at least on a granularity size of the instructions for performing the neural network computation and a parameter of a computing system for performing the neural network computation. In other words, an internal controller of the computing system (e.g., controller 630 or 730) may determine the granularity of instruction optimization scheduling for the interior of the computing system according to the obtained instruction granularity of the neural network model algorithm and the parameters of the computing system itself, and may send a fine-grained end flag to a functional module for executing a subsequent instruction, so that the functional module can start execution of the subsequent function before obtaining an indication that a previous instruction has finished executing.

In one embodiment, step S520 may include executing the current data load instruction using the data load engine. Step S530 may include starting execution of the data operation instruction using a data operation engine in response to completion of loading of the weight and feature map data for at least one complete operation unit before obtaining the indication that the current data load instruction is completely executed. Here, the first functional module is a data loading engine, and the second functional module is a data operation engine.

In a specific neural network acceleration application, appropriate parallel strategies are respectively selected according to actual conditions to perform convolution calculation as shown in fig. 2. The parallel computation described above may be performed in any one or more of the channel (C), the length and width (WxH), and the batch (B). The instruction scheduling strategy of the invention can determine the fine-grained corresponding relation between the current data loading instruction and the subsequent data operation instruction and execute corresponding calculation according to the data loading engine data loading sequence based on the data loading instruction and the parallel operation scheme adopted by the data operation engine based on the subsequent data operation instruction.

For example, when a weight stability architecture is adopted, since the data loading module loads the weight first and then loads the feature values line by line, for example, after the data required by a complete operation unit is loaded (or after all the data required by the operation units performing parallel operation within a clock cycle are loaded), the data operation engine is used to read the corresponding data from the internal cache for calculation. The computing system can reasonably determine the granularity of the current data loading instruction and the subsequent data operation instruction according to the weight fixing architecture, the granularity of the previous data loading instruction and the data operation parallelism degree in the data operation engine, and realize convolution (multiply-add) operation corresponding to feature value line-by-line loading in a mode of corresponding granularity level by level.

Correspondingly, when a feature value fixing (feature map status) architecture is adopted, since the data loading module loads feature values required for calculation first and then loads convolution kernels one by one, corresponding data can be read from the internal cache for calculation by using the data operation engine after the data required by a complete operation unit is loaded (or after the data required by the operation units for parallel operation in a clock cycle is loaded). The computing system can reasonably determine the granularity of the current data loading instruction and the subsequent data operation instruction according to the characteristic value fixed architecture, the granularity of the previous data loading instruction and the data operation parallelism degree in the data operation engine, and realize the convolution (multiply-add) operation corresponding to the one-by-one loading of the convolution kernel in a corresponding mode one by one.

When data loading is carried out, other data multiplexing strategies (such as a row fixed framework) can be adopted, or a neural network computing instruction adopting other data multiplexing strategies can be obtained, and no matter what strategy is adopted, the neural network computing system can provide a reasonable instruction scheduling scheme based on the multiplexing information, the self framework information and the dependency relationship of the instruction, so that partial parallel processing of subsequent dependent instructions can be realized more reasonably and efficiently.

In one embodiment, step S520 may include executing the current data operation instruction using a data operation engine, and step S530 may include, before obtaining the indication that the current data operation instruction is completely executed, in response to generation of at least one operation final result, caching the at least one operation final result in the internal cache, and starting execution of the data storage instruction using a data storage engine to store the at least one operation final result from the internal cache back to the external memory. Here, the first functional module is a data operation engine, and the second functional module is a data storage engine.

When the data operation instruction is followed by the data storage instruction, the controller may give a corresponding calculation end flag to the data storage engine according to the parameter of the specific operation instruction with respect to the output feature map calculation result obtained batch by batch (i.e., the calculation result to be stored back to the external memory, instead of the intermediate calculation result available for the current data operation instruction to be reused), and the data storage engine may then store the output feature map back to the external memory at a corresponding granularity according to the calculation end flag given one by one.

In one embodiment, step S520 may include executing the current data store instruction using a data store engine, and step S530 may include beginning execution of the data load instruction using a data load module after the output profile data is written to a bus buffer in response to the current data store instruction storing output profile data back to the external memory without a dependency on input profile data to be loaded from the external memory by a data load instruction directly dependent on the current data store instruction. Here, the first functional module is a data storage engine, and the second functional module is a data loading engine.

In most cases, the data to be loaded subsequently in the neural network has no dependency relationship with the data currently being stored, i.e., the loaded data can be executed without waiting for the completion of the stored data. In this case, the storing instruction of the data can be regarded as the end of the execution without being responded by the bus or the device. In the embodiment of performing data access via the bus, since the bus is occupied by the output data, it is necessary to read the load data after the occupation of the output data is finished. However, since the load data does not need to actually wait until the output data is actually stored back in the external memory, the execution of the subsequent data load instruction, for example, the granularity-wise segmented execution, may be started by a corresponding end indication sent by the controller (i.e., an internal processor indication indicating the end of occupation of the bus by the output data, rather than an indication of the end of the data store instruction) after, for example, the output profile data is written into the bus buffer.

The instruction scheduling scheme of the present invention may be further adapted to overlap each other for finer grain operations within an instruction. For example, batch normalization (batch normalization) operations in neural network models can typically be done at the time of data loading. If the load data is written as Ld and the load parameter is written as Lw, the actual load instruction may be Lw0, Lw1,. the next, Lwn, Ld0, Ld2,. the next, Ldm. Because there is no dependency relationship between the parameters, when executing Lw0-Lwn, the latter instruction can be executed without waiting for the former instruction to be actually executed, and the situation is similar when executing Ld 0-Ldn. In other words, there may be some overlap in the execution of Lw0-Lwn and Ld 0-Ldn. However, when executing Ld0-Ldn, it is necessary to wait Lwn for completion of execution before execution, thereby ensuring that all parameters are ready for data loading, and thus realizing, for example, the side-load side operation for the BL layer.

In a dedicated neural network computing platform, since high-parallelism computation is usually performed by using heterogeneous circuits such as a GPU, an FPGA, or an ASIC, the time required for performing data computation is short compared to a data access operation of an external memory via a bus. In addition, the neural network model algorithm has the characteristics of single related calculation type and extremely large calculation amount. In view of this, the instruction scheduling scheme of the present invention improves the efficiency of supplying data to the data operation engine by dividing the block-dependent instructions into smaller granularities, and makes more intensive use of data access of the bus, thereby improving the efficiency of the overall neural network computing system in performing the neural network computation.

It should be understood that although certain interdependencies exist between different classes of instructions executed by respective ones of the plurality of functional blocks in the present invention, the execution of the classes of instructions themselves may be in parallel. In other words, each functional module can execute the respective class of instructions in parallel, and realize reasonable execution of the dependent instructions under the instruction scheduling scheme of the present invention. In other words, a plurality of current instructions and a plurality of first functional modules for executing the current instructions may exist at the same time, and a plurality of subsequent instructions and a plurality of second functional modules for executing the subsequent instructions may also exist at the same time, so that multi-dependency parallel execution of multiple modules is realized.

In one embodiment, the neural network computing system of the present invention may be implemented in a system on a chip (SoC) that includes a general purpose processor, memory, and digital circuitry. Figure 9 shows an example of a SoC that can be used to implement the neural network computations involved in the present invention.

In one embodiment, the deep learning network required by the present system, such as a convolutional neural network, may be implemented by a digital circuit portion (e.g., FPGA) on the SoC. For example, a neural network dedicated processor implemented using a GPU, FPGA, or ASIC implements an instruction scheduling scheme in accordance with the present invention. Since the neural network model performs parallel computations, it is naturally advantageous to implement the neural network computation function by logic hardware, in particular a GPU, FPGA or ASIC, and to enable lower power consumption than software implementations.

In one embodiment, all the parameters of the neural network obtained by the previous training may be stored in a memory (e.g., a main memory, corresponding to the external memory in fig. 6 and 7) of the system on chip, and when the neural network inference calculation (e.g., target detection) is performed later, the parameters of each layer of the neural network are first read from the main memory and then executed by the programmable logic module shown in fig. 9. It should be understood that other architectures than that shown by the programmable logic module of FIG. 9 may also be used to implement the neural network computing system of the present invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While embodiments of the present invention have been described above, the above description is illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An instruction scheduling method for performing neural network computations, comprising:

obtaining instructions for performing neural network computations;

executing the current instruction by using the first functional module; and

starting execution of a subsequent instruction directly dependent on the current instruction using a second functional module before the current instruction finishes execution based on at least parameter information of the current instruction and dependency information of the subsequent instruction,

wherein starting execution of a subsequent instruction using a second functional module before completion of execution of the current instruction based on at least parameter information of the current instruction and dependency information of the subsequent instruction directly dependent on the current instruction comprises:

dividing the execution of the current instruction into two phases of dependence and no dependence at least based on the type information of the current instruction and the subsequent instruction;

directly generating the current instruction end marker if the dependent phase has completed; and

and executing the subsequent instruction by using a second functional module at least based on the current instruction end mark.

2. The method of claim 1, wherein initiating execution of a subsequent instruction that directly depends on the current instruction using a second functional module before the current instruction finishes execution based at least on parameter information of the current instruction and dependency information of the subsequent instruction comprises:

dividing execution of the current instruction into a plurality of stages based at least on the parameter information and the dependency information;

generating a phase end marker if at least one of the plurality of phases has completed; and

executing the subsequent instruction using a second functional module based at least on the end of stage indication.

3. The method of claim 2, wherein executing the subsequent instruction using a second functional module based at least on an indication of an end of at least one of the plurality of stages being completed comprises:

and executing the subsequent instruction by using a second functional module based on the data obtained by the completed at least one stage.

4. The method of claim 2, wherein a granularity size of the plurality of stages of partitioning is determined based at least on a granularity size of the instructions for performing neural network computations and parameters of a computing system used to perform the neural network computations.

5. The method of claim 1, wherein the instructions obtained to perform neural network computations comprise:

data loading instructions for loading data for neural network computation from an external memory to an internal cache, the data for neural network computation including parameter data and feature map data;

reading the parameter data and the feature map data from the internal cache to perform operation and storing an operation result back in a data operation instruction of the internal cache; and

and storing the operation result from the internal cache back to the data storage instruction of the external memory.

6. The method of claim 5, wherein executing the current instruction using the first functional module comprises:

executing the current data load instruction using the data load engine; and

starting execution of a subsequent instruction directly dependent on the current instruction using a second functional module before the current instruction completes execution based on at least parameter information of the current instruction and dependency information of the subsequent instruction, comprising:

and before acquiring the indication information that the current data loading instruction is completely executed, responding to the completion of the loading of the weight and the characteristic diagram data for at least one complete arithmetic unit, and starting the execution of the data arithmetic instruction by using a data arithmetic engine.

7. The method of claim 5, wherein executing the current instruction using the first functional module comprises:

executing the current data operation instruction by using a data operation engine; and

starting execution of a subsequent instruction directly dependent on the current instruction using a second functional module before the current instruction finishes execution based on at least parameter information of the current instruction and dependency information of the subsequent instruction, comprising:

before acquiring the indication information that the current data operation instruction is executed completely, responding to the generation of at least one operation final result, caching the at least one operation final result into the internal cache, and starting the execution of the data storage instruction by using a data storage engine to store the at least one operation final result from the internal cache back to the external memory.

8. The method of claim 5, wherein executing the current instruction using the first functional module comprises:

executing the current data store instruction using the data store engine; and

in response to an absence of a dependency between output profile data stored back to the external memory by a current data store instruction and input profile data to be loaded from the external memory by a data load instruction directly dependent on the current data store instruction, initiating execution of the data load instruction using a data load module after the output profile data is written to a bus buffer.

9. The method of claim 1, wherein explicit dependency information of the instruction from other instructions is included in the fetched instruction, and the explicit dependency information in the current instruction is used as dependency information for subsequent instructions that directly depend on the current instruction.

10. A neural network computing system, comprising:

a plurality of functional modules that perform respective functions based on instructions for performing neural network computations;

an internal cache for caching data required for performing neural network computations; and

a controller to:

executing the current instruction by using the first functional module; and

wherein the controller is further configured to:

dividing execution of the current instruction into two phases, dependent and independent, based on at least type information of the current instruction and the subsequent instruction;

directly generating the current instruction end marker when the dependent stage is completed; and

11. The computing system of claim 10, wherein the controller is further to:

12. The computing system of claim 11, wherein the controller is further to:

13. The computing system of claim 11, wherein a granularity size of the plurality of stages of partitioning is determined by the controller based at least on a granularity size of the instructions for performing neural network computations and parameters of a computing system for performing the neural network computations.

14. The computing system of claim 10, wherein the plurality of functional modules comprise:

a data load engine to execute a data load instruction to load data for neural network computations from an external memory to an internal cache, the data for neural network computations including parameter data and feature map data;

the data operation engine is used for reading the parameter data and the characteristic diagram data from the internal cache to perform operation and storing an operation result back in the data operation instruction of the internal cache; and

a data store engine to execute a data store instruction to store the operation result from the internal cache back to the external memory.

15. The computing system of claim 14, wherein the first functional module is a data load engine, the second functional module is a data operation engine, and the data operation engine initiates execution of the data operation instruction in response to the data load engine loading weight and profile data that completes at least one complete operation unit.

16. The computing system of claim 14, wherein the first functional module is a data operation engine, the second functional module is a data storage engine, and in response to the data operation engine generating at least one operation final result and caching the at least one operation final result to the internal cache, the data storage engine initiates execution of the data storage instruction to store the at least one operation final result from the internal cache back to the external memory.

17. The computing system of claim 14, wherein the first functional module is a data store engine, the second functional module is a data load engine, and in response to the data store engine storing back to the external memory output profile data that does not have a dependency on input profile data to be loaded from the external memory by a data load instruction that is directly dependent on the current data store instruction, execution of the data load instruction begins using the data load module after the output profile data is written to a bus buffer.

18. The computing system of claim 10 wherein explicit dependency information of the instruction with other instructions is included in the instructions for performing neural network computations, and the controller uses the explicit dependency information in the current instruction as dependency information for the subsequent instructions that depend directly on the current instruction.

19. The computing system of claim 10, wherein the computing system is implemented at least in part by a GPU, FPGA, or ASIC.