CN113469337A

CN113469337A - Compiling method for optimizing neural network model and related product

Info

Publication number: CN113469337A
Application number: CN202110729713.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-01
Anticipated expiration: 2041-06-29
Also published as: CN113469337B

Abstract

The present disclosure relates to a compiling method, a compiler, an apparatus, a computing device and a board for optimizing a neural network model, the computing device being included in a combined processing device, the combined processing device further including an interface device and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme of the disclosure can obviously improve the computing performance of the intelligent computing system comprising the artificial intelligence processor.

Description

Compiling method for optimizing neural network model and related product

Technical Field

The present disclosure relates generally to the field of artificial intelligence. More particularly, the present disclosure relates to a compiling method, a compiler, an apparatus and a computer program product for optimizing a neural network model, an integrated circuit device including the compiler or the apparatus, and a board including the integrated circuit device.

Background

In recent years, with the reduction of data acquisition difficulty and the great increase of hardware computing power, deep learning has been developed rapidly and algorithms thereof are also widely applied to various industries. Nevertheless, as the picture size of the neural network input increases year by year and the parameters of the network also increase, for networks with a huge number of parameters, the calculation power is still the bottleneck that hinders the algorithm development and application. Therefore, how to improve the utilization rate of hardware power and improve the operation efficiency of the network becomes an optimization focus of many algorithm providers.

In neural networks that include deep learning, computational power is typically concentrated in convolution ("conv") operations, and an increase in input in a convolution operation typically results in an exponential increase in the amount of computation. To reduce the number of parameters of the network, further extraction of the features of the network is typically performed by an averaging pooling ("AvgPooling") operation. Thus, the structure of Conv + avgpoling often appears in neural networks. However, such architectures suffer from a number of drawbacks. First, since there is a summation process in Conv and AvgPooling is also a summation in nature, there is an excessive addition operation in the structure of Conv + AvgPooling, resulting in a waste of computation power. Second, the output results after the Conv calculation need to be stored in an additional location. Since the Conv output is typically several times larger than the AvgPooling output, this makes existing architectures inefficient use of current storage resources, and in this manner also increases I/O bandwidth, thereby compromising output and computational efficiency. Furthermore, since the algorithm in the Conv + AvgPooling structure is operated in two steps, one in Conv and the other in AvgPooling, it may introduce the problem that the floating point precision takes 'big and small' numbers, i.e. in the data accumulation, the number with smaller absolute value is taken 'by the number with larger absolute value' due to the limitation of the floating point precision, which may introduce a certain error.

Disclosure of Invention

In view of the technical problems mentioned in the background section above, the present disclosure proposes a solution for optimizing a neural network model comprising a "Conv + AvgPooling" structure as described above. With the scheme of the present disclosure, convolutional and pooling layers in a neural network may be fused to obtain a fused convolutional layer in the context of the present disclosure. Thus, the computational complexity in the existing "Conv + avgpoulg" architecture can be reduced and the computational performance of an intelligent computing system comprising an artificial intelligence processor can be significantly improved. To this end, the present disclosure provides solutions for optimizing neural network models in a number of aspects as follows.

In a first aspect, the present disclosure provides a compilation method for optimizing a neural network model, wherein the neural network model comprises a convolutional layer and a pooling layer connected to each other, the compilation method being performed by a general-purpose processor and comprising: acquiring convolution parameters and weights of the convolution layer and pooling parameters of the pooling layer; fusing the convolution parameters and the pooling parameters to obtain fusion parameters; optimizing the neural network model according to the fusion parameters and the pooling parameters to convert the convolutional layer and the pooling layer into a fusion convolutional layer, wherein the fusion weight of the fusion convolutional layer is obtained by utilizing the conversion of the fusion parameters and the pooling parameters to the weight of the convolutional layer; and compiling the optimized neural network model into a corresponding binary instruction sequence so as to be distributed to an artificial intelligent processor to execute a corresponding task.

In a second aspect, the present disclosure provides a compiler for optimizing a neural network model, wherein the neural network model comprises a convolutional layer and a pooling layer connected to each other, the compiler comprising: an obtaining module for obtaining convolution parameters and weights of the convolutional layer and pooling parameters of a pooling layer; the fusion module is used for fusing the convolution parameters and the pooling parameters to obtain fusion parameters; an optimization module, configured to optimize the neural network model according to the fusion parameter and the pooling parameter, so as to convert the convolutional layer and the pooling layer into a fusion convolutional layer, where a fusion weight of the fusion convolutional layer is obtained by using a conversion of the fusion parameter and the pooling parameter to the weight of the convolutional layer; and the distribution module is used for compiling the optimized neural network model into a corresponding binary instruction sequence so as to distribute the binary instruction sequence to the artificial intelligent processor to execute a corresponding task.

In a third aspect, the present disclosure provides an apparatus for optimizing a neural network model, comprising: at least one processor; and at least one memory for storing program instructions that, when loaded and executed by the at least one processor, cause the apparatus to perform the method as set forth in the preceding and following embodiments.

In a fourth aspect, the present disclosure provides a computer program product comprising program instructions which, when executed by a processor, implement the method as described in the preceding and following embodiments.

In a fifth aspect, the present disclosure provides a computing device comprising an artificial intelligence processor configured to execute a sequence of binary instructions compiled according to a compilation method as described above and in various embodiments below.

In a fifth aspect, the present disclosure provides a board comprising a computing device as described above and in various embodiments below.

With the fusion scheme provided in the aspects of the present disclosure, existing convolution and pooling operations can be optimized to the maximum extent. In particular, the disclosed approach may reduce the overall computational effort in performing tasks associated with neural network models through fusion operations. Further, by fusing the convolutional layer and the pooling layer, it is possible to overcome, for example, the I/O overhead caused by the output of the existing convolutional layer and improve the utilization efficiency of hardware. Meanwhile, due to the fusion of the convolutional layer and the pooling layer, errors caused by accumulation in the operation of the convolutional layer and the pooling layer can be reduced, and therefore the precision and the accuracy of the neural network model operation are improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram illustrating an internal architecture of a single core computing device, according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an internal architecture of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present disclosure;

FIG. 6 is a simplified block diagram illustrating certain layers of a neural network model to which the disclosed aspects relate;

FIG. 7 is a schematic block diagram illustrating convolutional layer operation in a neural network model in accordance with an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram illustrating the operation of pooling layers in a neural network model in accordance with an embodiment of the present disclosure;

FIG. 9 is a flow diagram illustrating a compilation method for optimizing a neural network model in accordance with an embodiment of the present disclosure;

FIG. 10 is a schematic diagram illustrating a weight conversion operation according to an embodiment of the present disclosure;

FIG. 11 is a schematic block diagram illustrating a compiler in accordance with an embodiment of the present disclosure; and

FIG. 12 is an operational block diagram illustrating an artificial intelligence computing system in accordance with an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first," "second," and "third," etc. as may be used in the claims, the description, and the drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. It should be understood that the configuration and composition shown in FIG. 1 is merely an example, and is not intended to limit the aspects of the present disclosure in any way.

As shown in fig. 1, board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one implementation scenario, it may be integrated with one or more combined processing devices. The combined processing device can be an artificial intelligence operation unit, is used for supporting various deep learning and machine learning algorithms, meets the intelligent processing requirements in complex scenes in the fields of computer vision, voice, natural language processing, data mining and the like, and particularly applies deep learning technology to the field of cloud intelligence in a large quantity. One of the significant characteristics of cloud-based intelligent application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high, whereas the board card 10 of the embodiment is suitable for cloud-based intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 according to the above-described embodiment. As shown in fig. 2, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a Dynamic Random Access Memory (DRAM) DRAM 204.

The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations, it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing means 203 through the interface means 202 to collectively complete the user-specified operations. In aspects of the present disclosure, the computing device may be configured to perform various tasks of the optimized neural network model, such as performing a fused convolution operation using fused parameters of the convolutional and pooling layers as will be described later in the present disclosure.

The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the Processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general purpose and/or special purpose Processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure. According to aspects of the present disclosure, when implemented as a general-purpose processor, the processing device 203 may perform a compilation operation for optimizing the neural network model in order to compile the neural network model into a sequence of binary instructions executable by the computing device.

The DRAM 204 is used for storing Data to be processed, and is a Double Data Rate (DDR) memory, which is typically 16G or larger in size and is used for storing Data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The single-core computing device 301 is used for processing input data such as computer vision, voice, natural language processing, data mining, and the like, and the single-core computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is used for performing the core calculation of the deep learning algorithm, i.e. matrix multiplication and convolution. The storage module 33 is used to store or transport related data, and includes a Neuron storage unit (Neuron RAM, NRAM)331, a parameter storage unit (Weight RAM, WRAM)332, and a Direct Memory Access (DMA) 333. NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store the convolution kernel of the deep learning network, i.e. the weight; the DMA 333 is connected to the DRAM 204 through the bus 34 for performing data transfer between the single-core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 is designed in a hierarchical structure, with the multi-core computing device 41 being a system on a chip that includes at least one cluster (cluster) according to the present disclosure, each cluster in turn including a plurality of processor cores. In other words, the multi-core computing device 41 is constructed in a system-on-chip-cluster-processor core hierarchy. In a system-on-chip hierarchy, as shown in FIG. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be multiple (2 as shown in the figure for example) external memory controllers 401, which are used to respond to the access request issued by the processor core and access the external memory device, i.e. the off-chip memory (e.g. DRAM 204 in fig. 2) in the context of this disclosure, so as to read data from or write data to the off-chip. The peripheral communication module 402 is used for receiving a control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute a task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a Global synchronization Barrier Controller (GBC) for coordinating the operation progress of each cluster and ensuring the synchronization of information. The plurality of clusters 405 of the present disclosure are the computing cores of the multi-core computing device 41. Although 4 clusters are exemplarily shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently execute a deep learning algorithm.

Looking at the cluster hierarchy, as shown in fig. 4, each cluster 405 may include a plurality of processor cores (IPU core)406 and a memory core (MEM core)407, which may include, for example, a cache memory (e.g., LLC) as described in the context of the present disclosure.

The processor cores 406 are exemplarily shown as 4 in the figure, the present disclosure does not limit the number of the processor cores 406, and the internal architecture thereof is as shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3, and as such may include three modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described herein again. It should be particularly noted that the storage module 53 may include an Input/Output Direct Memory Access (IODMA) module 533 and a transport Direct Memory Access (MVDMA) module 534. IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM 204 through broadcast bus 409; the MVDMA 534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.

Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM 204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, the memory core 407 may have the capability of scalar operations to perform scalar operations.

The Memory core 407 may include a Static Random-Access Memory (SRAM)408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) 410, and a Global Direct Memory Access (GDMA) 411. In one implementation scenario, SRAM 408 may assume the role of a high performance data transfer station. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be individually obtained by the processor cores 406 to the DRAM 204, but rather is relayed between the processor cores 406 via the SRAM 408. Further, the memory core 407 only needs to quickly distribute multiplexed data from the SRAM 408 to the plurality of processor cores 406, so that it is possible to improve inter-core communication efficiency and significantly reduce off-chip input/output access.

Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication among processor cores 406, communication among cluster 405, and data transfer between cluster 405 and DRAM 204, respectively. As will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM 408 of the cluster 405 to the DRAM 204 or to read data from the DRAM 204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM 204 and NRAM 431 or WRAM 432 may be achieved via 2 ways. The first way is to communicate with the NRAM 431 or WRAM 432 directly with the DRAM 204 through the IODAM 433; the second way is to transmit data between the DRAM 204 and the SRAM 408 through the GDMA 411, and transmit data between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although the second approach may require more components and longer data flow, in some embodiments, the bandwidth of the second approach is substantially greater than that of the first approach, and thus it may be more efficient to perform communication between DRAM 204 and NRAM 431 or WRAM 432 in the second approach. It is understood that the data transmission schemes described herein are merely exemplary, and those skilled in the art can flexibly select and adapt various data transmission schemes according to the specific arrangement of hardware in light of the teachings of the present disclosure.

In other embodiments, the functionality of GDMA 411 and the functionality of IODMA 533 may be integrated in the same component. Although the present disclosure considers GDMA 411 and IODMA 533 as different components for convenience of description, it will be within the scope of protection of the present disclosure for a person skilled in the art as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA 534 may be implemented by the same component.

The hardware architecture and its internal structure of the present disclosure are described in detail above in conjunction with fig. 1-5. It is to be understood that the above description is intended to be illustrative, and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also change the board card and the internal structure of the present disclosure, and these changes still fall into the protection scope of the present disclosure. The scheme of the present disclosure for a system on chip will be described in detail below.

Fig. 6 is a simplified block diagram illustrating certain layers of a neural network model 600 in accordance with aspects of the present disclosure. As known to those skilled in the art, a neural network model typically includes an input layer, an output layer, and a hidden layer located intermediate the input and output layers. In accordance with aspects of the present disclosure, the aforementioned hidden layers may include the interconnected convolutional layer 602 and pooling layer 603 shown in fig. 6. In one implementation scenario, the convolutional layer may receive an output from a previous layer, i.e., input 601 shown in fig. 6, which may be referred to as an input feature map, for example. The convolutional layer may obtain an output feature map and input it to the pooling layer by performing a convolution operation (including a multiply-add operation) on the input feature map with a specific filter (or convolutional kernel). Depending on the application scenario, the pooling layer may perform pooling operations on the input data, including common average pooling or sum pooling, thereby reducing the number of outputs 604 by reducing the size of the inputs (e.g., the aforementioned output feature maps).

Figure 7 is a schematic block diagram illustrating convolutional layer operation in a neural network model in accordance with an embodiment of the present disclosure. As shown in the figure, the convolution layer of the neural network model may be convolved by applying a convolution kernel to the input feature map, so as to perform feature extraction to obtain an output feature map.

An input feature map of size 6 × 6 × 3 (i.e., input neuron data in the context of the present disclosure) is exemplarily shown, which may represent 3 feature maps of size 6 × 6 (i.e., three-dimensional tensors of size 6 × 6 × 3), representing three different features, respectively. The input feature map in this example has a width W of 6 and a height H of 6. The number of input feature maps may also be referred to as the number of input channels Ci. For example, there are 3 feature maps, also called 3 feature channels, as an example input in the figure.

Also illustrated in fig. 7 are convolution kernels (or filters) of size 2 × 3 × 3 × 3, which may represent 2 volumetric convolution kernels of size 3 × 3 × 3 (i.e., 2 three-dimensional tensors of size 3 × 3 × 3), each convolution kernel having 3 different two-dimensional convolution kernels of size 3 × 3, corresponding to 3 different two-dimensional eigenmaps of the input eigenmaps. The number of stereo convolution kernels may be referred to as the number of output channels Co, which is 2 in this example. In each stereo convolution kernel, the number of two-dimensional convolution kernels may be referred to as the input channel number Ci, which is consistent with the channel number of the input feature map. Each two-dimensional convolution kernel has a corresponding width Kw and height Kh, which are both 3 in this example.

As further shown in the figure, the convolution result of the input feature map and the convolution kernel is output as 2 two-dimensional feature maps of 4 × 4 size. Here, the convolution result of the input feature map and the lower stereo convolution kernel yields the lower 1 two-dimensional output feature map of 4 × 4. And the value at each position in the two-dimensional output characteristic diagram is obtained by performing convolution operation on the corresponding block of each input characteristic diagram and the corresponding convolution kernel and then summing the convolution operation. For example, the figure shows that the value of the (0,0) position on the lower output feature map is obtained by performing convolution operation on a block framed by a black cube in the input feature map and a lower stereo convolution kernel to obtain 3 values, and then adding the values to obtain a final value. In order to obtain the output of other positions, the position of the convolution kernel can be moved on the input feature map, namely the sliding operation of the convolution kernel along the input feature map. In the example of the figure, the convolution step (Sx, Sy) is (1,1), so that when the horizontal (i.e. width direction) or vertical (i.e. height direction) moves downwards for one lattice, the convolution operation is performed, and then the value of the (0,1) or (1,0) position on the lower output feature map can be obtained.

As can be seen from the above description, in a convolutional layer of a neural network model, there is a group of input feature maps, which contains H × W × C pieces of information, where H and W are the height and width of the input feature maps, respectively, and C is the number of input feature maps, also called the number of input channels. The convolutional layer has convolutional kernels of size C × Co, Kh × Kw, where C is the number of input channels, Co is the number of output feature maps (or output channels), and Kh and Kw are the height and width of the convolutional kernels, respectively. The output feature map contains Ho × Wo × Co information, where Ho and Wo are the height and width of the output feature map, respectively, and Co is the number of output channels. In addition, convolution steps (Sx, Sy) are involved in the convolution operation, and the size of the convolution steps affects the size of the output feature map.

FIG. 8 is a schematic block diagram illustrating the operation of pooling layers in a neural network model in accordance with an embodiment of the present disclosure. For ease of understanding, a Feature Map (Feature Map) of 4 × 4 size is exemplarily shown in the figure, which includes 16 data elements having a certain numerical size. The signature graph shown here may be used as an example of the output signature graph in fig. 7. In one embodiment, Average Pooling (Average Pooling) may be performed on the output feature map at the Pooling level, for example, Pooling with a Pooling kernel having a certain size and a sliding step size may result in a 2 × 2 output result as shown in the middle of FIG. 8. Similarly, in one embodiment, performing Sum Pooling (Sum Pooling) on the output signature graph at the Pooling level using a Pooling kernel having a size and a sliding step size may result in an output result as shown on the right side of FIG. 8.

FIG. 9 is a flow diagram illustrating a compilation method 900 for optimizing a neural network model according to an embodiment of the present disclosure. As previously mentioned, the neural network model herein may include convolutional and pooling layers interconnected as shown in FIG. 6. In one embodiment, the method 900 may be performed by a general purpose processor.

As shown in fig. 9, at step S902, convolution parameters and weights of convolutional layers and pooling parameters of pooling layers are acquired. In one implementation scenario, the dimensions of the convolution kernel of the convolutional layer and the pooling kernel of the pooling layer herein may include one or more dimensions. Based on this, the foregoing convolution parameters may include a size parameter (as denoted by conv _ kx and conv _ ky below) and a step size parameter (as denoted by conv _ sx and conv _ sy below) of the convolution kernel. Preferably, the size of the size parameter of the convolution kernel may be set to be larger than the size of the step size parameter of the convolution kernel. Likewise, the pooling parameters may include a size parameter (as represented by pool _ kx and pool _ ky, below) and a step size parameter (as represented by pool _ sx and pool _ sy, below) of the pooled core.

Next, at step S904, the above-mentioned convolution parameters and pooling parameters are fused to obtain fusion parameters. Similar to the properties of the convolution and pooling parameters, the fusion parameters herein may include a size parameter (as denoted by new _ kx and new _ ky below) and a stride parameter (as denoted by new _ sx and new _ sy below) of the fused convolution kernel. In one embodiment, the dimension direction may be used as a reference, and the dimension parameter and the step size parameter in each dimension direction of the convolution kernel and the dimension parameter and the step size parameter in the corresponding dimension direction of the pooling kernel are respectively fused, so that the dimension parameter and the step size parameter of the fused convolution kernel can be respectively obtained. As a specific example, when the aforementioned dimension direction is a horizontal direction, the horizontal dimension parameter and the step size parameter of the convolution kernel may be fused with the horizontal dimension parameter and the step size parameter of the pooling kernel to obtain the dimension parameter and the step size parameter of the fused convolution kernel, respectively. Similarly, when the aforementioned dimension direction is a longitudinal direction, the longitudinal dimension parameter and the step size parameter of the convolution kernel may be fused with the longitudinal dimension parameter and the step size parameter of the pooling kernel to obtain the dimension parameter and the step size parameter of the fused convolution kernel, respectively.

Next, at step S906, the neural network model is optimized according to the aforementioned fusion parameters and pooling parameters to convert the convolutional and pooling layers into a fused convolutional layer. In other words, by using the parameters for optimization, the scheme of the present disclosure can convert two interconnected layers, namely, the convolutional layer and the pooling layer, into a new convolutional layer (i.e., the aforementioned fused convolutional layer), thereby significantly improving the computational efficiency of the artificial intelligence computing system and avoiding additional computation workload. In addition, this also simplifies the storage process, as the convolutional layer and the pooling layer are fused into a single layer.

In one embodiment, to implement a fused convolutional layer instead of a convolutional layer and a pooling layer, the present disclosure proposes to convert the weights of the convolutional layer using a fusion parameter and a pooling parameter, thereby obtaining a fusion weight of the fused convolutional layer. Depending on the application scenario, the pooling layer according to aspects of the present disclosure may perform an average pooling and/or a sum pooling as illustrated in fig. 8. In one embodiment, when applying the average pooling layer, in order to obtain the fusion weights for the fused convolutional layers, the scheme of the present disclosure proposes to stack and sum the weights of the convolutional layers according to the fusion parameters, and then the stacked and summed weights may be pooled according to the pooling parameters to obtain the fusion weights for the fused convolutional layers. In one embodiment, the foregoing pooling parameters of the average pooling layer may further include a weight. In this case, the aforementioned stacking and summing the weights of the convolutional layers according to the fusion parameters may further include stacking and weighted summing the weights of the convolutional layers according to the fusion parameters and the weights. With regard to the operation herein, detailed description will be made later in conjunction with fig. 10.

Finally, at step S906, the optimized neural network model is compiled into a corresponding binary instruction sequence for distribution to the artificial intelligence processor for execution of a corresponding task. The scheme of the present disclosure can be applied to the training phase of the neural network, including forward propagation and/or backward propagation, according to different application requirements. For example, when applied to the forward propagation of a neural network model, a convolution operation may be performed based on the fusion parameters and the fusion weights. Specifically, after acquiring input neuron data (e.g., the aforementioned input feature map) of the neural network model, a convolution operation may be performed on the input neuron data according to the fusion parameters and the fusion weights, thereby obtaining output neuron data. As a result of the fusion operation of the present disclosure, the output neuron data may have the form of averaged pooled or summed pooled data as in fig. 8. For another example, when applied to back propagation of a neural network model, the gradient error of the next layer in the neural network model may be obtained first. Then, a convolution operation may be performed on the gradient error of the previous next layer based on the fusion parameters and the fusion weights, so that a weight gradient of the current layer and a gradient error of the previous layer in the neural network model may be obtained.

Fig. 10 is a schematic diagram illustrating a weight conversion operation according to an embodiment of the present disclosure. As previously described, the weight conversion operation of the present disclosure is used to convert the weights of the convolutional layers into fused weights for the fused convolutional layers. In particular, the present disclosure transforms weights of the convolutional layers using the fusion parameters and pooling parameters, resulting in fusion weights for the fused convolutional layers. The operation of the present disclosure to obtain fusion parameters is first described below.

As previously described, the present disclosure fuses the convolution parameters and pooling parameters to obtain fused parameters. Specifically, the convolution (Conv) parameters include size parameters (Conv _ kx, Conv _ ky) and sliding step size (abbreviated as "step") parameters (Conv _ sx, Conv _ sy) of the convolution Kernel (Kernel). Further, the pooling (AvgPooling) parameter includes a size parameter (pool _ kx, pool _ ky) and a step size parameter (pool _ sx, pool _ sy) of the pooled core. Further, the fusion parameters may include a size parameter (new _ kx, new _ ky) and a step size parameter (new _ sx, new _ sy) of the fusion convolution kernel.

Since the convolution kernel, the pooling kernel, and the respective step sizes involve one or more dimensions, and the fusion operation of the present disclosure may be to perform fusion of the size parameter and the step size parameter based on the corresponding dimensions. Thus, for purposes of illustration only, the foregoing lateral direction (equivalent to the "Kw" dimension shown in fig. 7) is denoted by "x" and the foregoing longitudinal direction (equivalent to the "Kh" dimension shown in fig. 7) is denoted by "y", and the following exemplary fusion operation involves fusion in both dimensions. Further, the above-mentioned "s" indicates correlation with "step size" and "k" indicates correlation with "kernel", and "conv" indicates correlation with "convolution" and "pool" indicates correlation with "pooling".

Based on the above-described parameter settings, the fusion parameters can be exemplarily determined by the following equations:

new_kx＝(pool_kx-1)×conv_sx+conv_kx

new_ky＝(pool_ky-1)×conv_sy+conv_ky

new_sx＝pool_sx×conv_sx

new_sy＝pool_sy×conv_sy

the above-described manner of determining the fusion parameters is merely exemplary, and those skilled in the art may determine the fusion parameters in other suitable manners according to the teachings of the present disclosure. Further, only part of the above fusion parameters may be updated, while the other parameters are kept unchanged. In other words, parameters in one or more dimensions may be fused to obtain fused parameters for that dimension, while parameters in other dimensions may not be fused. For example, in one implementation scenario, the fusion operation may be performed only on the kx dimension to obtain the fusion parameters new _ kx, while no fusion operation may be performed on the remaining parameters, thereby keeping the parameters unchanged. By supporting the fusion operation in different dimensions, the scheme of the disclosure enables the fusion operation to be more flexible to adapt to different task scenarios.

After obtaining the fusion parameters through the above exemplary operations, the present disclosure may then perform a weight conversion operation as illustrated in fig. 10, i.e., obtain a new convolution kernel. When the weights of the convolutional layers are converted using the above-mentioned fusion parameters, the weight conversion can be regarded as stacking the original weights.

As shown in the upper part of fig. 10, two original weights (the same weights as shown in the figures with blocks 1 and 2) are stacked along the x-direction (i.e. lateral direction), and a pool is assumed_kx2, i.e. the size of the pooling nucleus in the x-direction is 2. Specifically, the two original weights are aligned in the y-direction (i.e., the vertical direction), wherein one weight is shifted from the other weight by (conv _ sx) in the x-direction, thereby obtaining an overlapping space with a dimension (conv _ kx-conv _ sx) in the x-direction. Then, the values of the two weights in the overlapped space are added, and the values of the other spaces are kept unchanged, so that a weight with the size of (conv _ sx + conv _ kx) × conv _ ky can be obtained. In this example, the size parameter new _ kx of the fusion convolution kernel in the fusion parameters of the present disclosure is (pool _ kx-1) × conv _ sx + conv _ kx ═ 2-1) × conv _ sx + conv _ kx ═ conv _ sx + conv _ kx.

Based on the above principle, pool _ kx weights (two weights of pool _ kx ═ 2 in the above example) can be stacked in the x direction, so as to obtain a weight with a size of ((pool _ kx-1) × conv _ sx + conv _ kx) × conv _ ky. After summing the weights of the overlapping parts, a new weight in the x-axis direction can be obtained. Similarly, it is also possible to stack and sum the two original weights along the y direction (i.e., the vertical direction), thereby realizing weight conversion in the y-axis direction, as shown in the lower part in fig. 10. Based on this, pool _ ky weights (two weights of pool _ ky — 2 in the example in the figure) are stacked in the y direction, so as to obtain weights of the size ((pool _ ky-1) × conv _ sy + conv _ ky) × conv _ kx. By converting the weights in the manner, the effect of combining multiplication and reducing addition can be achieved, so that the error problem caused by adding a large number and a small number in the prior art can be reduced to a certain extent, and the accuracy of the network is improved.

After the above-mentioned weight conversion operation, a new weight of new _ kx × new _ ky size can be finally obtained. Next, an average pooling operation may be performed on the new weights, i.e., divided by the pooled kernel size of AvgPooling. Taking the pooling operation as shown in fig. 8 as an example, assuming that the feature map on the right side of fig. 8 is the new weight obtained after conversion, the pooling kernel with the size and step size of 2 × 2 in the figure can be used to slide and average over the new weight, so as to obtain the convolution kernels with new weights having the size parameters of new _ kx and new _ ky and the step size parameters of new _ sx and new _ sy. It is understood that the pooling parameters of the average pooling layer herein may also include weight values. Based on this, during the sliding and averaging, the operation of averaging after multiplying the pooled weight and the weight of the convolution kernel can be performed, thereby further improving the operation precision of the network model.

Fig. 11 is a schematic block diagram illustrating a compiler 1100 according to an embodiment of the present disclosure. As shown in fig. 11, the compiler of the present disclosure is of a modular design and, thus, may include an acquisition module 1102, a fusion module 1104, an optimization module 1106, and an assignment module 1108. According to the solution of the present disclosure, a plurality of modules herein may respectively perform steps corresponding to those in the method 900 shown in fig. 9, and therefore the description about the steps in the method 900 also applies to the operation of the corresponding modules herein and is not repeated herein. In a heterogeneous system composed of a general-purpose processor and an artificial intelligence processor, the compiler of the present disclosure may be implemented on the general-purpose processor to obtain binary code instructions of an optimized neural network model. Thereafter, the binary code instructions may be transferred via a driver interface to an artificial intelligence processor (located, for example, in chip 101 shown in fig. 1 or in computing device 201 of fig. 2) for performing corresponding tasks by the artificial intelligence processor, such as operational tasks involving convolution and pooling.

FIG. 12 is a block diagram illustrating the operation of an artificial intelligence computing system 1200 according to an embodiment of the disclosure. As shown in FIG. 12, with aspects of the present disclosure, an artificial intelligence computing system 1200 can begin performing operations 1201, and various types of input are received at an input module 1202. Next, a fusion operation of the parameters is performed by the fusion module 1206, and weight conversion is performed by the weight conversion module 1207. Thereafter, the fused parameters and the transformed weights obtained after the fusion can be transmitted to the operation module 105, which executes the corresponding calculation task. Finally, the operation ends at 1208. The following describes exemplary modules.

In operation, the input module 1202 may include a weight input 1202-1 for receiving weights, including convolution kernel data for convolution operations. Further, the parameter input 1202-2 as shown may receive various parameters, such as the convolution Conv and pooled AvgPooling parameters described previously, including the convolution kernel size parameters Conv _ kx, Conv _ ky of Conv, the step size parameters Conv _ sx, Conv _ sy, the pooled kernel size parameters pool _ kx, pool _ ky of pooled AvgPooling, and the step size parameters pool _ sx, pool _ sy. Additionally, neuron input 1202-3 may receive various neuron data, including input feature map data, such as convolutional layers.

After obtaining the parameters, the fusion module 1206 may calculate fusion parameters of Conv and avgpouling, which mainly include the fused convolution kernel sizes new _ kx and new _ ky and the step sizes new _ sx and new _ sy. The specific fusion operation of the fusion module has been described previously, and is not described herein again. After obtaining the fusion parameters, the weight conversion module 1027 may convert the weights of the convolution kernels based on the fusion parameters, for example, by using the stacking approach described above in connection with fig. 10 to obtain new weights.

Further, the operation module 1205 may execute the corresponding calculation task according to the fusion parameters obtained by the fusion module 1206 and the transformation weights obtained by the weight transformation module 1207. Since the preceding fusion operation has already been performed, the operation module 1205 can obtain the same result as the existing (Conv + AvgPooling) structure by performing the convolution operation only by using the convolution kernels with convolution kernel sizes of new _ kx and new _ ky and step sizes of new _ sx and new _ sy, thereby significantly reducing the amount of calculation, simplifying the calculation flow and data storage, and also improving the overall performance of the calculation system.

The aspects of the present disclosure are described in detail above with reference to the accompanying drawings. According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like.

Further, the electronic device or apparatus of the present disclosure may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, an electronic device or apparatus with high computing power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of brevity, this disclosure presents some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the aspects of the disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of being practiced in other than the specifically disclosed embodiments, and that the acts or modules illustrated herein are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the related description of other embodiments.

In particular implementation, based on the description and teachings of the present disclosure, one skilled in the art will appreciate that several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when the aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), it may be stored in a memory that may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in the embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("Read Only Memory", abbreviated as ROM), a Random Access Memory ("Random Access Memory", abbreviated as RAM), a removable hard disk, or an optical disk, and various media capable of storing program codes.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory ("Resistive Random Access Memory", abbreviated as RRAM), a Dynamic Random Access Memory ("Dynamic Random Access Memory", abbreviated as DRAM), a Static Random Access Memory ("Static Random Access Memory", abbreviated as SRAM), an Enhanced Dynamic Random Access Memory ("Enhanced Dynamic Random Access Memory", abbreviated as EDRAM), a High Bandwidth Memory ("High Bandwidth Memory", abbreviated as HBM), a Hybrid Memory Cube ("Hybrid Memory Cube", abbreviated as HMC), a ROM, a RAM, or the like.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A compilation method for optimizing a neural network model, wherein the neural network model includes a convolutional layer and a pooling layer connected to each other, the compilation method being performed by a general-purpose processor and comprising:

acquiring convolution parameters and weights of the convolution layer and pooling parameters of the pooling layer;

fusing the convolution parameters and the pooling parameters to obtain fusion parameters;

optimizing the neural network model according to the fusion parameters and the pooling parameters to convert the convolutional layer and the pooling layer into a fusion convolutional layer, wherein the fusion weight of the fusion convolutional layer is obtained by utilizing the conversion of the fusion parameters and the pooling parameters to the weight of the convolutional layer; and

compiling the optimized neural network model into a corresponding binary instruction sequence so as to be distributed to an artificial intelligent processor to execute a corresponding task.

2. The compilation method of claim 1, wherein dimensions of a convolution kernel of the convolutional layer and a pooling kernel of the pooling layer comprise one or more dimensions, wherein the convolution parameters comprise a size parameter and a step size parameter of the convolution kernel, the pooling parameters comprise a size parameter and a step size parameter of the pooling kernel, the fusion parameters comprise a size parameter and a step size parameter of a fusion convolution kernel, and wherein fusing the convolution and pooling parameters to obtain the fusion parameters comprises:

and respectively fusing the dimension parameters and the step length parameters of the convolution kernels in all dimension directions with the dimension parameters and the step length parameters of the pooling kernels in the corresponding dimension directions by taking the dimension directions as the reference so as to respectively obtain the dimension parameters and the step length parameters of the fused convolution kernels.

3. The coding method of claim 2, wherein the dimension direction is horizontal or vertical, and wherein fusing the convolution and pooling parameters to obtain the fused parameter comprises:

fusing the transverse size parameter and the step length parameter of the convolution kernel with the transverse size parameter and the step length parameter of the pooling kernel to respectively obtain a size parameter and a step length parameter of a fused convolution kernel; and/or

And fusing the longitudinal size parameter and the step length parameter of the convolution kernel with the longitudinal size parameter and the step length parameter of the pooling kernel to respectively obtain the size parameter and the step length parameter of the fused convolution kernel.

4. The compilation method of any of claims 1-3 wherein the pooling layer comprises an average pooling layer or a sum pooling layer.

5. The compilation method of claim 4, wherein the pooling layer is the average pooling layer, and wherein transforming the weights of the convolutional layers using the fusion parameters and the pooling parameters to obtain the fusion weights comprises:

stacking and summing the weights of the convolutional layers according to the fusion parameters; and

and performing pooling operation on the stacked and summed weights according to the pooling parameters to obtain a fusion weight of the fusion convolutional layer.

6. The coding method of claim 2 or 5, wherein a size of a size parameter of the convolution kernel is larger than a size of a step size parameter of the convolution kernel.

7. The compilation method of claim 5, wherein the averaging pooling parameters of the pooling layer further comprises weights, wherein stacking and summing the weights of the convolutional layers according to the fusion parameters comprises:

and stacking the weights of the convolutional layers according to the fusion parameters and the weights, and performing weighted summation.

8. A compiler for optimizing a neural network model, wherein the neural network model comprises a convolutional layer and a pooling layer connected to each other, the compiler comprising:

an obtaining module for obtaining convolution parameters and weights of the convolutional layer and pooling parameters of a pooling layer;

the fusion module is used for fusing the convolution parameters and the pooling parameters to obtain fusion parameters;

an optimization module, configured to optimize the neural network model according to the fusion parameter and the pooling parameter, so as to convert the convolutional layer and the pooling layer into a fusion convolutional layer, where a fusion weight of the fusion convolutional layer is obtained by using a conversion of the fusion parameter and the pooling parameter to the weight of the convolutional layer; and

and the distribution module is used for compiling the optimized neural network model into a corresponding binary instruction sequence so as to distribute the binary instruction sequence to the artificial intelligent processor to execute a corresponding task.

9. An apparatus for optimizing a neural network model, comprising:

at least one processor; and

at least one memory for storing program instructions that, when loaded and executed by the at least one processor, cause the apparatus to perform the method of any of claims 1-7.

10. A computer program product comprising program instructions which, when executed by a processor, implement a compiling method according to any of claims 1-7.

11. A computing device comprising an artificial intelligence processor configured to execute a sequence of binary instructions compiled by the compilation method of any of claims 1-7.

12. A board comprising the computing device of claim 11.