CN113469337B

CN113469337B - Compiling method for optimizing neural network model and related products thereof

Info

Publication number: CN113469337B
Application number: CN202110729713.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2024-04-05
Anticipated expiration: 2041-06-29
Also published as: CN113469337A

Abstract

The present disclosure relates to a compiling method, a compiler, an apparatus, a computing device and a board card for optimizing a neural network model, the computing device being included in a combined processing device, which may further include an interface device and other processing devices. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure can significantly improve the computing performance of a smart computing system including an artificial intelligence processor.

Description

Compiling method for optimizing neural network model and related products thereof

Technical Field

The present disclosure relates generally to the field of artificial intelligence. More particularly, the present disclosure relates to a compiling method, a compiler, an apparatus and a computer program product for optimizing a neural network model, which have been used to perform the foregoing compiling method, an integrated circuit device comprising the foregoing compiler or apparatus, and a board card comprising the integrated circuit device.

Background

In recent years, with the decrease of the difficulty of data acquisition and the great improvement of hardware computing power, deep learning has been rapidly developed and algorithms thereof have been widely used in various industries. Nevertheless, as the size of the pictures input by the neural network grows year by year and the parameters of the network also grow, the computing power remains a bottleneck that hinders the development and application of algorithms for networks having massive parameters. Therefore, how to increase the utilization of hardware power and improve the operating efficiency of the network is becoming an optimization focus for many algorithm providers.

In neural networks that include deep learning, computational effort is typically concentrated in convolution ("conv") operations, and an increase in input in convolution operations typically results in an exponential increase in computational effort. To reduce the number of parameters of the network, further extraction of the characteristics of the network is typically performed by an average pooling ("AvgPooling") operation. Thus, the structure of conv+avgpooling often occurs in neural networks. However, there are a number of drawbacks to such an architecture. First, since there is a summation process in Conv, and AvgPooling is also a summation in nature, there is an extra addition operation in the structure of conv+avgpooling, resulting in a waste of computation power. Second, the output result after Conv calculation needs to be stored in an additional location. Since the Conv output is typically several times as large as the AvgPooling output, this makes the existing architecture unable to fully utilize the current memory resources and increases the I/O bandwidth in this way, resulting in a tradeoff in output and computational efficiency. Furthermore, since the algorithm in the conv+avgpooling structure is operated in two times, one in Conv and the other in AvgPooling, this may introduce a problem of "large number eating fraction" caused by floating point precision, i.e., in data accumulation, the number with smaller absolute value is "eaten" by the number with larger absolute value due to the limitation of floating point precision, which may introduce a certain error.

Disclosure of Invention

In view of the technical problems mentioned in the background section above, the present disclosure proposes a solution for optimizing a neural network model including the above "conv+avgpooling" structure. By utilizing the scheme of the disclosure, the convolution layer and the pooling layer in the neural network can be fused to obtain the fused convolution layer in the context of the disclosure. Thus, the computational complexity in the existing Conv+AvgPooling structure can be reduced and the computational performance of the intelligent computing system including the artificial intelligence processor can be significantly improved. To this end, the present disclosure provides schemes for optimizing neural network models in various aspects as follows.

In a first aspect, the present disclosure provides a compiling method for optimizing a neural network model, wherein the neural network model comprises a convolutional layer and a pooling layer connected to each other, the compiling method being performed by a general-purpose processor and comprising: acquiring convolution parameters and weights of the convolution layer and pooling parameters of a pooling layer; fusing the convolution parameters and the pooling parameters to obtain fusion parameters; optimizing the neural network model according to the fusion parameters and the pooling parameters to convert the convolution layers and the pooling layers into fusion convolution layers, wherein the fusion weights of the fusion convolution layers are obtained by converting the weights of the convolution layers by using the fusion parameters and the pooling parameters; and compiling the optimized neural network model into a corresponding binary instruction sequence to be distributed to an artificial intelligent processor for executing a corresponding task.

In a second aspect, the present disclosure provides a compiler for optimizing a neural network model, wherein the neural network model includes a convolutional layer and a pooling layer connected to each other, the compiler comprising: the acquisition module is used for acquiring the convolution parameters and weights of the convolution layer and the pooling parameters of the pooling layer; the fusion module is used for fusing the convolution parameters and the pooling parameters to obtain fusion parameters; the optimization module is used for optimizing the neural network model according to the fusion parameters and the pooling parameters so as to convert the convolution layers and the pooling layers into fusion convolution layers, wherein the fusion weights of the fusion convolution layers are obtained by converting the weights of the fusion parameters and the pooling parameters on the convolution layers; and the distribution module is used for compiling the optimized neural network model into a corresponding binary instruction sequence so as to distribute the corresponding binary instruction sequence to the artificial intelligent processor to execute the corresponding task.

In a third aspect, the present disclosure provides an apparatus for optimizing a neural network model, comprising: at least one processor; and at least one memory for storing program instructions that, when loaded and executed by the at least one processor, cause the apparatus to perform the methods described above and in the various embodiments below.

In a fourth aspect, the present disclosure provides a computer program product comprising program instructions which, when executed by a processor, implement the method as described above and in the various embodiments below.

In a fifth aspect, the present disclosure provides a computing device comprising an artificial intelligence processor configured to execute a binary instruction sequence compiled according to a compilation method as described above and in embodiments below.

In a fifth aspect, the present disclosure provides a board card comprising a computing device as described above and in various embodiments below.

With the fusion scheme provided in the aspects of the present disclosure, existing convolution operations and pooling operations can be optimized to the greatest extent. In particular, aspects of the present disclosure may reduce the overall computational effort in performing tasks associated with neural network models through fusion operations. Further, by fusing the convolutional layer and the pooling layer, I/O overhead caused by the output of the existing convolutional layer, for example, can be overcome and the utilization efficiency of hardware can be improved. Meanwhile, due to the fusion of the convolution layer and the pooling layer, errors caused by accumulation in the two operations can be reduced, and therefore the accuracy and precision of the neural network model operation are improved.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

fig. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating the internal structure of a single core computing device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the internal architecture of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating the internal architecture of a processor core according to an embodiment of the present disclosure;

FIG. 6 is a simplified block diagram illustrating particular layers of a neural network model to which the disclosed solution relates;

FIG. 7 is a schematic block diagram illustrating convolutional layer operation in a neural network model, according to an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram illustrating the operation of a pooling layer in a neural network model, according to an embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a compilation method for optimizing a neural network model, according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram illustrating a weight conversion operation according to an embodiment of the present disclosure;

FIG. 11 is a schematic block diagram illustrating a compiler according to an embodiment of the present disclosure; and

FIG. 12 is an operational block diagram illustrating an artificial intelligence computing system according to an embodiment of the disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

It should be understood that the terms "first," "second," and "third," and the like, as may be used in the claims, specification, and drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of this disclosure are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in this disclosure and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. It will be understood that the structure and composition shown in fig. 1 is merely an example, and is not intended to limit aspects of the present disclosure in any way.

As shown in fig. 1, the board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing means. The combined processing device can be an artificial intelligent operation unit, is used for supporting various deep learning and machine learning algorithms, meets intelligent processing requirements under complex scenes in the fields of computer vision, voice, natural language processing, data mining and the like, and particularly, the deep learning technology is widely applied to the cloud intelligent field. One remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the requirements on the energy storage capacity and the computing capacity of the platform are very high, while the board card 10 of the embodiment is suitable for the cloud intelligent application, and has huge off-chip storage, on-chip storage and strong computing capacity.

As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface means 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 according to the above-described embodiment. As shown in fig. 2, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a dynamic random access memory (Dynamic Random Access Memory, DRAM) DRAM 204.

The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor. In some operations, it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing device 203 through the interface device 202 to collectively accomplish user-specified operations. In aspects of the present disclosure, the computing device may be configured to perform various tasks of the optimized neural network model, such as performing fused convolution operations using fusion parameters obtained after fusion of the convolutional layer and the pooling layer, which will be described later in the present disclosure.

The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated circuit, ASIC), a Field-programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (Central Processing Unit, CPU), graphics processor (Graphics Processing Unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure. According to aspects of the present disclosure, when the processing device 203 is implemented as a general-purpose processor, it may perform a compiling operation for optimizing the neural network model in order to compile the neural network model into a binary sequence of instructions executable by the computing device.

The DRAM 204 is used to store Data to be processed, and is a Double Data Rate (DDR) memory, typically 16G or more in size, for storing Data of the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal architecture of computing device 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language processing, data mining, etc., and the single-core computing device 301 comprises three major modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (Instruction Fetch Unit, IFU) 311 and an instruction decode unit (Instruction Decode Unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is used for performing core computation of the deep learning algorithm, i.e. matrix multiplication and convolution. The storage module 33 is used for storing or carrying related data, including a Neuron storage unit (NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a direct memory access module (Direct Memory Access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; the DMA 333 is connected to the DRAM 204 through the bus 34 for performing data transfer between the single core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal architecture of computing device 201 as a multi-core. The multi-core computing device 41 employs a hierarchical design, the multi-core computing device 41 being a system-on-a-chip that includes at least one cluster (cluster) according to the present disclosure, each cluster in turn including a plurality of processor cores. In other words, the multi-core computing device 41 is structured in a hierarchy of system-on-chip-cluster-processor cores. At the system-on-chip level, as shown in FIG. 4, the multi-core computing device 41 includes an external memory controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be a plurality (2 are shown for example) of external memory controllers 401 for accessing external memory devices, i.e., off-chip memory in the context of the present disclosure (e.g., DRAM 204 in fig. 2), in response to access requests issued by processor cores, to read data from or write data to off-chip. The peripheral communication module 402 is configured to receive a control signal from the processing device 203 through the interface device 202, and activate the computing device 201 to perform a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402, and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC) for coordinating the working progress of each cluster to ensure synchronization of information. The plurality of clusters 405 of the present disclosure are the compute cores of the multi-core computing device 41. Although 4 clusters are illustratively shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently perform a deep learning algorithm.

At the cluster level, as shown in fig. 4, each cluster 405 may include a plurality of processor cores (IPU cores) 406 and one memory core (MEM core) 407, which may include, for example, a cache memory (e.g., LLC) as described in the context of the present disclosure.

The number of processor cores 406 is illustratively shown as 4 in the figure, the present disclosure does not limit the number of processor cores 406, and its internal architecture is shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3 and may likewise include three modules: a control module 51, an operation module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and will not be described herein. It should be noted that the storage module 53 may include an Input/output direct memory access module (Input/Output Direct Memory Access, IODMA) 533, and a handling direct memory access module (Move Direct Memory Access, MVDMA) 534.IODMA 533 controls access to NRAM 531/WRAM 532 and DRAM 204 over broadcast bus 409; MVDMA 534 is used to control access to NRAM 531/WRAM 532 and memory cells (SRAM) 408.

Returning to FIG. 4, the memory cores 407 are primarily used to store and communicate, i.e., to store shared data or intermediate results between the processor cores 406, as well as to perform communications between the clusters 405 and the DRAM 204, between the clusters 405, between the processor cores 406, etc. In other embodiments, the memory core 407 may have scalar operation capabilities to perform scalar operations.

The Memory core 407 may include a Static Random-Access Memory (SRAM) 408, a broadcast bus 409, a clustered direct Memory Access module (Cluster Direct Memory Access, CDMA) 410, and a global direct Memory Access module (Global Direct Memory Access, GDMA) 411. In one implementation, SRAM 408 may assume the role of a high performance data transfer station. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be obtained by the processor cores 406 each to the DRAM 204, but rather is transferred between the processor cores 406 via the SRAM 408. Further, the memory core 407 need only distribute the multiplexed data from the SRAM 408 to the plurality of processor cores 406 quickly, so that inter-core communication efficiency can be improved and off-chip input/output accesses can be significantly reduced.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between clusters 405, and data transfer between clusters 405 and DRAM 204, respectively. As will be described below, respectively.

The broadcast bus 409 is used to facilitate high-speed communications among the processor cores 406 within the cluster 405. The broadcast bus 409 of this embodiment supports inter-core communications including unicast, multicast and broadcast. Unicast refers to the transmission of data from point to point (e.g., single processor core to single processor core), multicast is a communication scheme that transfers a piece of data from SRAM 408 to a specific number of processor cores 406, and broadcast is a communication scheme that transfers a piece of data from SRAM 408 to all processor cores 406, a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control access of the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. From the foregoing, it is appreciated that communication between DRAM 204 and NRAM 431 or WRAM 432 may be accomplished via 2 ways. The first way is to communicate with NRAM 431 or WRAM 432 directly with DRAM 204 through IODAM 433; the second way is to transfer data between DRAM 204 and SRAM 408 via GDMA 411 and then transfer data between SRAM 408 and NRAM 431 or WRAM 432 via MVDMA 534. Although the second approach may require more elements to participate and the data flow is longer, in some embodiments, the bandwidth of the second approach is substantially greater than that of the first approach, and thus it may be more efficient to perform communication between DRAM 204 and NRAM 431 or WRAM 432 by the second approach. It will be appreciated that the data transmission schemes described herein are merely exemplary, and that various data transmission schemes may be flexibly selected and adapted by one skilled in the art in light of the teachings of the present disclosure, depending on the specific arrangement of hardware.

In other embodiments, the functionality of the GDMA 411 and the functionality of the IODMA 533 may be integrated into the same component. Although the GDMA 411 and the IODMA 533 are considered to be different components for convenience of description, it will be apparent to those skilled in the art that the functions and technical effects achieved are similar to those of the present disclosure, i.e., they are within the scope of protection of the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410, MVDMA 534 may be implemented by the same component.

The hardware architecture of the present disclosure and its internal structure are described in detail above in connection with fig. 1-5. It is to be understood that the above description is intended to be illustrative and not restrictive. According to different application scenarios and hardware specifications, a person skilled in the art may also change the board card and its internal structure of the present disclosure, and these changes still fall within the protection scope of the present disclosure. The scheme for the system on chip of the present disclosure will be described in detail below.

Fig. 6 is a simplified block diagram illustrating particular layers of a neural network model 600 to which the present disclosure relates. As known to those skilled in the art, neural network models generally include an input layer, an output layer, and a hidden layer intermediate the input layer and the output layer. The foregoing hidden layers may include the interconnected convolution layer 602 and pooling layer 603 shown in fig. 6, according to aspects of the present disclosure. In one implementation scenario, a convolutional layer may receive an output from a previous layer, i.e., input 601 shown in FIG. 6, which may be referred to as an input profile, for example. By performing convolution operations (including multiply-add operations) on the input signature with a particular filter (or convolution kernel), the convolution layer may obtain an output signature and input to the pooling layer. Depending on the application scenario, the pooling layer may perform pooling operations on the input data, including common average pooling or sum pooling, thereby reducing the number of outputs 604 by reducing the size of the input (e.g., the aforementioned output profile).

Fig. 7 is a schematic block diagram illustrating convolutional layer operation in a neural network model, according to an embodiment of the present disclosure. As shown, the convolution layer of the neural network model may perform a convolution process by applying a convolution kernel to the input feature map, thereby performing feature extraction to obtain an output feature map.

Input feature maps of size 6 x 3 (i.e., input neuron data in the context of the present disclosure) are illustratively shown, which may represent 3 feature maps of size 6 x 6 (i.e., three-dimensional tensors of 6 x 3), each representing three different features. The width W of the input feature map is 6 and the height H is also 6 in this example. The number of input feature maps may also be referred to as the number of input channels Ci. For example, the example in the figure inputs 3 feature maps, also referred to as 3 feature channels.

Also illustrated in fig. 7 is a convolution kernel (or filter) of size 2 x 3, which may represent 2 three-dimensional convolution kernels of size 3 x 3 (i.e., 2 three-dimensional tensors of size 3 x 3), each convolution kernel in turn having 3 different two-dimensional convolution kernels of size 3 x 3, corresponding to 3 different two-dimensional feature maps of the input feature map. The number of stereo convolution kernels may be referred to as the output channel number Co, in this example 2. In each stereo convolution kernel, the number of two-dimensional convolution kernels may be referred to as the input channel number Ci, which corresponds to the channel number of the input feature map. Each two-dimensional convolution kernel has a respective width Kw and height Kh, both Kw and Kh being 3 in this example.

As further shown in the figure, the convolution result of the input feature map and the convolution kernel outputs 2 two-dimensional feature maps of 4×4 size. Here, the convolution result of the input feature map and the underlying stereo convolution kernel yields the underlying 1 4 x 4 two-dimensional output feature map. The value at each position in the two-dimensional output feature map is obtained by adding the corresponding block of each input feature map and the corresponding convolution kernel after performing convolution operation. For example, the figure shows that the value of the (0, 0) position on the lower output feature map is obtained by performing convolution operation on the block outlined by the black cube in the input feature map and the lower stereo convolution kernel to obtain 3 values, and then adding the values to obtain the final value. To obtain an output at other locations, the position of the convolution kernel may be shifted on the input signature, i.e., the sliding operation of the convolution kernel along the input signature. In the example of the figure, the convolution step (Sx, sy) is (1, 1), so that when the convolution operation is performed after shifting one cell in the transverse direction (i.e. the width direction) to the right or in the longitudinal direction (i.e. the height direction), the value of the (0, 1) or (1, 0) position on the lower output feature map can be obtained respectively.

From the above description, in a convolutional layer of the neural network model, there is a set of input feature maps, which contains h×w×c pieces of information, where H and W are the height and width of the input feature maps, respectively, and C is the number of input feature maps, which is also called the number of input channels. The convolution layer has convolution kernels of size c×co, kh×kw, where C is the number of input channels, co is the number of output feature maps (or the number of output channels), and Kh and Kw are the height and width of the convolution kernels, respectively. The output feature map contains Ho x Wo x Co pieces of information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, in the convolution operation, a convolution step (Sx, sy) is also involved, and the size of the convolution step affects the size of the output feature map.

Fig. 8 is a schematic block diagram illustrating pooling layer operation in a neural network model according to an embodiment of the present disclosure. For ease of understanding, a 4×4 size Feature Map (Feature Map) is exemplarily shown in the figure, which includes 16 data elements having a certain numerical size. The feature map shown here may be used as an example of the output feature map in fig. 7. In one embodiment, an Average Pooling (Average Pooling) may be performed on the output feature map at the Pooling layer, for example, a Pooling operation using a pooled check feature map having a certain size and a sliding step may result in a 2×2 output result as shown in the middle of fig. 8. Similarly, in one embodiment, sum Pooling (Sum Pooling) is performed at the Pooling layer using a Pooling core output feature map having a certain size and sliding step, and an output result as shown on the right side of fig. 8 may be obtained.

Fig. 9 is a flowchart illustrating a compilation method 900 for optimizing a neural network model, according to an embodiment of the present disclosure. As previously described, the neural network model herein may include a convolutional layer and a pooling layer connected to each other as shown in fig. 6. In one embodiment, the method 900 may be performed by a general purpose processor.

As shown in fig. 9, at step S902, convolution parameters and weights of a convolution layer and pooling parameters of a pooling layer are acquired. In one implementation scenario, the dimensions of the convolution kernel of the convolution layer and the pooling kernel of the pooling layer herein may comprise one or more dimensions. Based on this, the aforementioned convolution parameters may include a size parameter (as represented by conv_kx and conv_ky below) and a step size parameter (as represented by conv_sx and conv_sy below) of the convolution kernel. Preferably, the size of the size parameter of the convolution kernel may be set to be larger than the size of the step size parameter of the convolution kernel. Likewise, the pooling parameters may include the size parameters (as represented by pool_kx and pool_ky below) and the step size parameters (as represented by pool_sx and pool_sy below) of the pooling core.

Next, at step S904, the convolution parameters and the pooling parameters are fused to obtain fusion parameters. Similar to the properties of the convolution parameters and pooling parameters, the fusion parameters herein may include the size parameters (as represented below by new_kx and new_ky) and step size parameters (as represented below by new_sx and new_sy) of the fusion convolution kernel. In one embodiment, the dimension parameter and the step parameter of each dimension direction of the convolution kernel can be respectively fused with the dimension parameter and the step parameter of the corresponding dimension direction of the pooling kernel by taking the dimension direction as a reference, so that the dimension parameter and the step parameter of the fused convolution kernel can be respectively obtained. As a specific example, when the aforementioned dimension direction is transverse, the transverse dimension parameter and the step parameter of the convolution kernel may be fused with the transverse dimension parameter and the step parameter of the pooling kernel, so as to obtain the dimension parameter and the step parameter of the fused convolution kernel, respectively. Similarly, when the aforementioned dimension direction is longitudinal, the longitudinal dimension parameter and step parameter of the convolution kernel may be fused with the longitudinal dimension parameter and step parameter of the pooling kernel, so as to obtain the dimension parameter and step parameter of the fused convolution kernel, respectively.

Next, at step S906, the neural network model is optimized according to the fusion parameters and pooling parameters described above, so as to convert the convolutional layers and pooling layers into a fusion convolutional layer. In other words, by optimizing by using parameters, the scheme disclosed by the invention can convert two interconnected layers of the convolution layer and the pooling layer into a new convolution layer (namely the fusion convolution layer), thereby remarkably improving the computational efficiency of the artificial intelligence computing system and avoiding additional operand. In addition, this also simplifies the storage process, since the convolutional layer and the pooling layer are fused into one single layer.

In one embodiment, to implement a fused convolutional layer in place of the convolutional layer and the pooling layer, the present disclosure proposes to transform the weights of the convolutional layer with the fusion parameters and the pooling parameters, resulting in fused weights of the fused convolutional layer. According to different application scenarios, the pooling layer to which the present disclosure relates may perform average pooling and/or sum pooling as shown in fig. 8. In one embodiment, when an average pooling layer is applied, to obtain a fused weight of a fused convolutional layer, the scheme of the present disclosure proposes to stack and sum the weights of the convolutional layers according to a fused parameter, and then a pooling operation may be performed on the stacked and summed weights according to the pooling parameter to obtain the fused weight of the fused convolutional layer. In one embodiment, the pooling parameter of the average pooling layer may further include a weight. In this case, stacking and summing the weights of the convolution layers according to the fusion parameters as described above may further include stacking and weighting the weights of the convolution layers according to the fusion parameters and the weights. With respect to the operation herein, a detailed description will be made later with reference to fig. 10.

Finally, at step S906, the optimized neural network model is compiled into a corresponding binary instruction sequence for distribution to the artificial intelligence processor for executing the corresponding task. The aspects of the present disclosure may be applied to the training phase of a neural network, including forward propagation and/or backward propagation, depending on the needs of the application. For example, when applied in the forward propagation of a neural network model, convolution operations may be performed based on fusion parameters and fusion weights. Specifically, after input neuron data (such as the aforementioned input feature map) of the neural network model is acquired, convolution operation may be performed on the input neuron data according to the fusion parameters and the fusion weights, thereby obtaining output neuron data. Due to the fusion operations of the present disclosure, the output neuron data may have an average pooled or sum pooled data form as in fig. 8. For another example, when applied in the back propagation of the neural network model, the gradient error of the next layer in the neural network model may be acquired first. Then, a convolution operation can be performed on the gradient error of the next layer based on the fusion parameters and the fusion weights, so that the weight gradient of the current layer and the gradient error of the previous layer in the neural network model can be obtained.

Fig. 10 is a schematic diagram illustrating a weight conversion operation according to an embodiment of the present disclosure. As previously described, the weight conversion operations of the present disclosure are used to convert the weights of the convolutional layers into fused weights that fuse the convolutional layers. Specifically, the present disclosure converts weights of convolution layers using fusion parameters and pooling parameters, thereby obtaining fusion weights of fusion convolution layers. The operation of the present disclosure to obtain fusion parameters is first described below.

As previously described, the present disclosure fuses the convolution parameters and the pooling parameters to obtain fusion parameters. Specifically, the convolution (Conv) parameters include the size parameters (conv_kx, conv_ky) and the sliding step size (simply "step") parameters (conv_sx, conv_sy) of the convolution Kernel (Kernel). Further, the pooling (AvgPooling) parameters include a size parameter (pool_kx, pool_ky) and a step size parameter (pool_sx, pool_sy) of the pooling core. Further, the fusion parameters may include a size parameter (new_kx, new_ky) and a step size parameter (new_sx, new_sy) of the fusion convolution kernel.

Since the convolution kernel, the pooling kernel, and the respective step sizes involve one or more dimensions, and the fusion operations of the present disclosure may be to perform fusion of the size parameters and the step size parameters based on the corresponding dimensions. Thus, for ease of illustration only, the foregoing transverse direction (corresponding to the "Kw" dimension shown in fig. 7) is denoted by "x" and the foregoing longitudinal direction (corresponding to the "Kh" dimension shown in fig. 7) is denoted by "y", and the following exemplary fusion operations involve fusion in both dimensions. Further, "s" as described above refers to "step size" and "k" refers to "kernel", and "conv" refers to "convolution" and "pool" refers to "pooling".

Based on the parameter settings described above, the fusion parameters may be exemplarily determined by the following equations:

new_kx＝(pool_kx-1)×conv_sx+conv_kx

new_ky＝(pool_ky-1)×conv_sy+conv_ky

new_sx＝pool_sx×conv_sx

new_sy＝pool_sy×conv_sy

the above-described manner of determining the fusion parameters is merely exemplary, and one skilled in the art may determine the fusion parameters in other suitable manners in light of the teachings of the present disclosure. Further, only some of the above-described fusion parameters may be updated while other parameters remain unchanged. In other words, the fusion operation may be performed on parameters in one or more dimensions to obtain fused parameters for that dimension, while the fusion operation may not be performed on parameters in other dimensions. For example, in one implementation scenario, only the kx dimension may be fused to obtain the fused parameter new_kx, while the fusion operation may not be performed for the remaining parameters, thus keeping the parameters unchanged. By supporting fusion operations on different dimensions, the scheme of the disclosure makes the fusion operations more flexible to adapt to different task scenarios.

After obtaining the fusion parameters through the above-described exemplary operations, the present disclosure may then perform a weight conversion operation as illustrated in fig. 10, i.e., obtain a new convolution kernel. When the above fusion parameters are used to transform the weights of the convolutional layers, the weight transformation can be regarded as stacking the original weights.

As shown in the upper part of fig. 10, two original weights (the same weight as shown in blocks 1 and 2) are stacked along the x-direction (i.e., the lateral direction), and pool is assumed _kx =2, i.e. pooling coreThe size in the x direction is 2. Specifically, two original weights are aligned in the y-direction (i.e., longitudinal), one of which is offset in the x-direction by the amount of (conv_sx) relative to the other, resulting in an overlap space of size (conv_kx-conv_sx) in the x-direction. Then, the values of the two weights in the overlapping space are added, and the values of the other spaces remain unchanged, and at this time, a weight having a size of (conv_sx+conv_kx) conv_ky can be obtained. In this example, the size parameter new_kx= (pool_kx-1) ×conv_sx+conv_kx= (2-1) ×conv_sx+conv_kx=conv_sx+conv_kx of the fusion convolution kernel in the fusion parameters of the present disclosure.

Based on the above principle, a pool_kx weight (two weights of pool_kx=2 in the above example) may be stacked in the x-direction, so as to obtain a weight of size ((pool_kx-1) ×conv_sx+conv_kx) ×conv_ky. After summing the weights of the overlapping parts, new weights in the x-axis direction can be obtained. Similarly, two original weights may also be stacked and summed along the y-direction (i.e., longitudinal), thereby achieving weight conversion in the y-axis direction, as shown in the lower portion of fig. 10. Based on this, the y-direction stacking of the pool_ky weights (two weights of pool_ky=2 in the example in the figure) can be performed, thereby obtaining a weight ((pool_ky-1) ×conv_sy+conv_ky) ×conv_kx. By converting the weights in the above manner, the effect of combining multiplications and reducing additions can be achieved, so that the error problem caused by adding decimal numbers to decimal numbers in the prior art can be reduced to a certain extent, and the network precision is improved.

After the weight conversion operation, a new weight with a new_kx_new_ky size can be finally obtained. An average pooling operation, i.e. dividing by the pooling core size of AvgPooling, may then be performed on the new weight. Taking the pooling operation as shown in fig. 8 as an example, assuming that the feature map on the right side of fig. 8 is a new weight obtained after conversion, the pooling kernel with the size and step length of 2×2 as shown in the figure may be utilized to slide and average over the new weight, so as to obtain convolution kernels with new weights with the size parameters new_kx and new_ky and the step length parameters new_sx and new_sy. It is understood that the pooling parameters of the average pooling layer herein may also include weights. Based on the above, when sliding and averaging, the operation of averaging can be performed after multiplying the pooled weight and the weight of the convolution kernel, so that the operation precision of the network model is further improved.

Fig. 11 is a schematic block diagram illustrating a compiler 1100 according to an embodiment of the present disclosure. As shown in fig. 11, the compiler of the present disclosure takes a modular design and thus may include an acquisition module 1102, a fusion module 1104, an optimization module 1106, and an allocation module 1108. According to the solution of the present disclosure, the modules herein may perform steps corresponding to the method 900 shown in fig. 9, so descriptions about the steps in the method 900 are also applicable to the operations of the corresponding modules herein and are not repeated herein. In a heterogeneous system consisting of a general purpose processor and an artificial intelligence processor, the compiler of the present disclosure may be implemented on the general purpose processor to obtain the binary code instructions of the optimized neural network model. Thereafter, the binary code instructions may be transferred via the drive interface to an artificial intelligence processor (located, for example, in the chip 101 shown in FIG. 1 or in the computing device 201 of FIG. 2) for execution by the artificial intelligence processor of corresponding tasks, such as operational tasks involving convolution and pooling.

FIG. 12 is an operational block diagram illustrating an artificial intelligence computing system 1200 according to an embodiment of the disclosure. As shown in fig. 12, with aspects of the present disclosure, an artificial intelligence computing system 1200 may begin performing operations at 1201 and receive various types of input at an input module 1202. Next, a fusion operation of the parameters is performed by the fusion module 1206, and weight conversion is performed by the weight conversion module 1207. Thereafter, the fused parameters and the converted weights obtained after the fusion may be transferred to the operation module 105, which performs a corresponding calculation task. Finally, the operation ends at 1208. The respective modules are exemplarily described below.

In operation, the input module 1202 may include a weight input 1202-1 for receiving weights, including convolution kernel data for convolution operations. Further, as shown, the parameter input 1202-2 may receive various parameters, such as the convolutions Conv and the pooling AvgPooling parameters described above, including the convolutions Conv kernel size parameters conv_kx, conv_ky, the step size parameters conv_sx, conv_sy, the pooling AvgPooling pooling kernel size parameters pool_kx, pool_ky, and the step size parameters pool_sx, pool_sy. In addition, the neuron input 1202-3 may receive various neuron data, including input signature data such as convolutional layers.

After obtaining the parameters, the fusion module 1206 may calculate fusion parameters of Conv and AvgPooling, mainly including the fused convolution kernel sizes new_kx, new_ky and step sizes new_sx, new_sy. The specific fusion operation of the fusion module has been described previously, and will not be described in detail here. After obtaining the fusion parameters, the weight conversion module 1027 may convert the weights of the convolution kernels based on the fusion parameters, e.g., to obtain new weights using the stacking approach described above in connection with fig. 10.

Further, the operation module 1205 may perform a corresponding calculation task according to the fusion parameter obtained by the fusion module 1206 and the conversion weight obtained by the weight conversion module 1207. Since the previous fusion operation has been performed, the operation module 1205 can obtain the same result as the existing structure (conv+avgpooling) by performing the convolution operation only by using the convolution kernels having the convolution kernel sizes new_kx, new_ky and the step sizes new_sx, new_sy, thereby significantly reducing the amount of computation, simplifying the computation process and data storage, and also improving the overall performance of the computing system.

The aspects of the present disclosure are described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like.

Further, the electronic device or apparatus of the present disclosure may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as cloud, edge, terminal, and the like. In one or more embodiments, a computationally intensive electronic device or apparatus according to the aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the scheme of the present disclosure is not limited by the order of the described actions. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure also has an emphasis on each of them, depending on the solution. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the description and teachings of the present disclosure, one skilled in the art may appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are divided herein by taking into account the logic function, and there may be other manners of dividing the units when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units may be selected to achieve the purposes of the solution described in the embodiments of the disclosure. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically reside separately.

In some implementation scenarios, the above-described integrated units may be implemented in the form of software program modules. The integrated unit may be stored in a computer readable memory if implemented in the form of software program modules and sold or used as a stand alone product. In this regard, when the aspects of the present disclosure are embodied in a software product (e.g., a computer-readable storage medium), they may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of a method described by an embodiment of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("ROM"), a random access Memory ("Random Access Memory" RAM), a removable hard disk, or a compact disk, etc., which may store program codes.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as CPU, GPU, FPGA, DSP and ASICs, etc. Further, the foregoing storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory ("Resistive Random Access Memory", abbreviated RRAM), dynamic random access memory ("Dynamic Random Access Memory", abbreviated DRAM), static random access memory ("Static Random Access Memory", abbreviated SRAM), enhanced dynamic random access memory ("Enhanced Dynamic Random Access Memory", abbreviated EDRAM "), high bandwidth memory (" High Bandwidth Memory ", abbreviated HBM"), hybrid memory cube ("Hybrid Memory Cube", abbreviated HMC "), ROM, RAM, etc.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. The appended claims are intended to define the scope of the disclosure and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A compilation method for optimizing a neural network model, wherein the neural network model comprises a convolutional layer and a pooling layer that are connected to each other, the compilation method being performed by a general purpose processor and comprising:

acquiring convolution parameters and weights of the convolution layer and pooling parameters of a pooling layer;

fusing the convolution parameters and the pooling parameters to obtain fusion parameters;

optimizing the neural network model according to the fusion parameters and the pooling parameters to convert the convolution layers and the pooling layers into fusion convolution layers, wherein the fusion weights of the fusion convolution layers are obtained by converting the weights of the convolution layers by using the fusion parameters and the pooling parameters; and

Compiling the optimized neural network model into a corresponding binary instruction sequence to be distributed to an artificial intelligent processor for executing a corresponding task; wherein the method comprises the steps of

The artificial intelligence processor is used for processing computer vision, voice and natural language;

wherein the dimensions of the convolution kernel of the convolution layer and the pooling kernel of the pooling layer comprise one or more dimensions, wherein the convolution parameters comprise a size parameter and a step size parameter of the convolution kernel, the pooling parameters comprise a size parameter and a step size parameter of the pooling kernel, the fusion parameters comprise a size parameter and a step size parameter of a fusion convolution kernel, and wherein fusing the convolution parameters and the pooling parameters to obtain the fusion parameters comprises:

and respectively fusing the dimension parameters and the step size parameters of the convolution kernels in the dimension directions with the dimension parameters and the step size parameters of the pooling kernels in the corresponding dimension directions by taking the dimension directions as references, so as to respectively obtain the dimension parameters and the step size parameters of the fusion convolution kernels.

2. The compiling method of claim 1, wherein the dimension direction is a lateral or longitudinal direction, and wherein fusing the convolution parameters and pooling parameters to obtain the fused parameters comprises:

Fusing the transverse dimension parameter and the step length parameter of the convolution kernel with the transverse dimension parameter and the step length parameter of the pooling kernel to respectively obtain the dimension parameter and the step length parameter of the fusion convolution kernel; and/or

And fusing the longitudinal size parameter and the step size parameter of the convolution kernel with the longitudinal size parameter and the step size parameter of the pooling kernel to obtain the size parameter and the step size parameter of the fusion convolution kernel respectively.

3. The compilation method of any of claims 1-2, wherein the pooling layer comprises an average pooling layer or a sum pooling layer.

4. The compilation method of claim 3, wherein the pooling layer is the average pooling layer, and wherein converting weights of the convolutional layer with the fusion parameters and the pooling parameters to obtain the fusion weights comprises:

stacking and summing the weights of the convolution layers according to the fusion parameters; and

and carrying out pooling operation on the stacked and summed weights according to the pooling parameters to obtain fusion weights of the fusion convolution layers.

5. The compiling method of claim 1 or 4, wherein a size of a size parameter of the convolution kernel is larger than a size of a step size parameter of the convolution kernel.

6. The compilation method of claim 4, wherein the pooling parameters of the average pooling layer further comprise weights, wherein stacking and summing weights of the convolution layers according to the fusion parameters comprises:

and stacking and weighting and summing the weights of the convolution layers according to the fusion parameters and the weights.

7. A compiler for optimizing a neural network model, wherein the neural network model includes a convolutional layer and a pooling layer that are connected to each other, the compiler comprising:

the acquisition module is used for acquiring the convolution parameters and weights of the convolution layer and the pooling parameters of the pooling layer;

the fusion module is used for fusing the convolution parameters and the pooling parameters to obtain fusion parameters;

the optimization module is used for optimizing the neural network model according to the fusion parameters and the pooling parameters so as to convert the convolution layers and the pooling layers into fusion convolution layers, wherein the fusion weights of the fusion convolution layers are obtained by converting the weights of the fusion parameters and the pooling parameters on the convolution layers; and

the distribution module is used for compiling the optimized neural network model into a corresponding binary instruction sequence so as to distribute the binary instruction sequence to the artificial intelligent processor to execute a corresponding task; wherein the method comprises the steps of

8. An apparatus for optimizing a neural network model, comprising:

at least one processor; and

at least one memory for storing program instructions that, when loaded and executed by the at least one processor, cause the apparatus to perform the method of any of claims 1-6.

9. A computing device comprising an artificial intelligence processor configured to execute a binary instruction sequence compiled by a compilation method according to any of claims 1-6.

10. A board card comprising the computing device of claim 9.