CN115599738A

CN115599738A - Method for optimizing neural network model and related product

Info

Publication number: CN115599738A
Application number: CN202110781855.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2023-01-13

Abstract

The present disclosure relates to a method and related product for optimizing a neural network model, wherein the system on chip of the present disclosure is included in a computational processing device of a combined processing device, which may further include a universal interconnect interface and other processing devices. The computing processing device interacts with other processing devices to jointly complete computing operation designated by a user. The combined processing device may further comprise a storage device, which is connected with the computing processing device and the other processing device respectively, for storing data of the computing processing device and the other processing device. The scheme of the disclosure can optimize convolution operation in a system on chip and improve operation performance.

Description

Method for optimizing neural network model and related product

Technical Field

The present disclosure relates generally to the field of artificial intelligence technology. More particularly, the present disclosure relates to a method for optimizing a neural network model running on a system on chip, an integrated circuit device, a board, an apparatus for optimizing a neural network model running on a system on chip, and a computer-readable storage medium.

Background

The convolution operation of the neural network model may involve performing a matrix inner product on the input feature map and a convolution kernel, where the number of steps that the convolution kernel slides right or down each time in the input feature map may be regarded as a sliding step (referred to as step stride) of the convolution kernel. In the algorithm principle, the convolution operation with the step size equal to 1 and the convolution kernel size of 1 multiplied by 1 is equivalent to the matrix multiplication operation, so that the convolution operation with the convolution step size equal to 1 and the convolution kernel size of 1 multiplied by 1 can be replaced by the matrix multiplication operation, and the corresponding high-efficiency performance of the matrix multiplication operation can be applied to the convolution operation.

However, for the convolution operation with convolution step size larger than 1 and convolution kernel size of 1 × 1, the operation process is not equivalent to the matrix multiplication operation. Therefore, the convolution operation with convolution step larger than 1 and convolution kernel size of 1 × 1 cannot be directly replaced by the matrix multiplication operation, and thus the corresponding high efficiency performance brought by the matrix multiplication operation cannot be obtained.

Disclosure of Invention

In view of the technical problems mentioned in the background section above, the present disclosure provides a solution for optimizing a neural network model. Through the scheme disclosed by the invention, the calculation process of the convolution operation can be improved, so that the convolution operation with the convolution step size larger than 1 and the convolution kernel size of 1 multiplied by 1 can be converted into the matrix multiplication operation. Therefore, the performance advantage of matrix multiplication can be fully utilized in the operation process of the neural network model, and the performance of the system on chip in executing the operation is improved and promoted. Based on this, the present disclosure provides a solution for optimizing a neural network model in a number of aspects as follows.

In a first aspect, the present disclosure provides a method for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using a convolution kernel having a size of 1x1 and a step size greater than 1, the method comprising: obtaining effective data to be subjected to convolution operation with the convolution kernel in input data; continuously storing the valid data in a system on a chip to convert the convolution operation into a matrix multiplication operation; and performing the matrix multiplication operation between the valid data and the convolution kernel at the system-on-chip to obtain output data.

In a second aspect, the present disclosure provides a system on a chip comprising: a processor and a memory, and the memory storing program instructions, which when executed by the processor, implement a method for optimizing a neural network model running on a system on a chip according to a first aspect of the present disclosure.

In a third aspect, the present disclosure provides an integrated circuit device comprising a system on chip according to the second aspect of the present disclosure.

In a fourth aspect, the present disclosure provides a board comprising the integrated circuit device according to the third aspect of the present disclosure.

In a fifth aspect, the present disclosure provides an apparatus for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using a convolution kernel having a size of 1x1 and a step size greater than 1, the apparatus comprising: a processor; and a memory storing computer program code for performing the method for optimizing a neural network model running on a system on chip according to the first aspect of the present disclosure; a compiler that compiles the computer program code under control of the processor to generate a sequence of binary instructions for performing the method, wherein the sequence of binary instructions is adapted to be executed by an artificial intelligence processor.

In a sixth aspect, the present disclosure provides a computer readable storage medium comprising program instructions for optimizing a neural network model running on a system on chip, which when executed by a processor, implement the method for optimizing a neural network model running on a system on chip according to the first aspect of the present disclosure.

According to the scheme provided in the aspects of the present disclosure, the calculation process of the convolution operation can be improved so that the convolution operation with the convolution step size larger than 1 and the convolution kernel size of 1 × 1 can be processed in a matrix multiplication manner. Through the conversion of the operation mode, the scheme disclosed by the invention can fully play the performance advantages of high efficiency and convenience of matrix multiplication operation in the system on chip, thereby improving the operation efficiency and the overall performance of the system on chip.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. In the drawings, several embodiments of the disclosure are illustrated by way of example and not by way of limitation, and like or corresponding reference numerals indicate like or corresponding parts and in which:

fig. 1 is a block diagram illustrating a board card according to an embodiment of the present disclosure;

FIG. 2 is a block diagram illustrating an integrated circuit device according to an embodiment of the disclosure;

fig. 3 is a schematic diagram illustrating an internal structure of a single-core computing device according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an internal architecture of a multi-core computing device according to an embodiment of the present disclosure;

FIG. 5 is an internal block diagram illustrating a processor core according to an embodiment of the disclosure;

FIG. 6 is a flow diagram illustrating a method for optimizing a neural network model in accordance with an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating one example of a convolution operation process according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating a method for converting a convolution operation to a matrix multiplication operation in accordance with an embodiment of the present disclosure;

FIG. 9 is a schematic diagram illustrating the finding of a gradient of a neuron, in accordance with an embodiment of the present disclosure; and

fig. 10 is a detailed flow diagram illustrating a method for optimizing a neural network model according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It is apparent that the described embodiments are some embodiments of the present disclosure, not all embodiments, and the described embodiments may be appropriately combined to implement different applications according to scenes. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first," "second," and "third," etc. as may be used in the claims, the description, and the drawings of the present disclosure, are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present disclosure. It should be understood that the configuration and composition shown in FIG. 1 is merely an example, and is not intended to limit the aspects of the present disclosure in any way.

As shown in fig. 1, board 10 includes a Chip 101, which may be a System on Chip (SoC), i.e., a System on Chip as described in the context of the present disclosure. In one application scenario, a system-on-chip herein may include a processor and a memory, and the memory may be configured to store program instructions. Thus, when the aforementioned memory is executed by a processor, the matrix multiplication operation converted from the convolution operation, which will be discussed later in this disclosure, may be implemented.

In one implementation scenario, it may be integrated with one or more combined processing devices. The combined processing device can be an artificial intelligence arithmetic unit, is used for supporting various deep learning and machine learning algorithms, meets the intelligent processing requirements in the fields of computer vision, voice, natural language processing, data mining and the like under complex scenes, and particularly applies deep learning technology to the field of cloud intelligence in a large quantity. A significant characteristic of cloud-based intelligent application is that the input data volume is large, and the requirements on the storage capacity and the calculation capacity of the platform are high, whereas the board card 10 of the embodiment is suitable for cloud-based intelligent application, and has huge off-chip storage, on-chip storage and powerful calculation capacity.

As further shown in the figure, the chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 may be, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like, according to different application scenarios. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 may also include a memory device 104 for storing data, including one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 may be configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 according to the above-described embodiment. As shown in fig. 2, the combined processing device 20 may include a computing device 201, an interface device 202, a processing device 203, and a Dynamic Random Access Memory (DRAM) DRAM 204.

The computing device 201 may be configured to perform user-specified operations, primarily implemented as a single-core intelligent processor or a multi-core intelligent processor. In some operations it may be used to perform calculations in terms of deep learning or machine learning, and may also interact with the processing means 203 through the interface means 202 to collectively perform user specified operations. In one implementation scenario, the computing device herein may be configured to perform a convolution operation or a matrix multiplication operation in the context of the present disclosure.

The interface device 202 may be used to transfer data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data (e.g., various types of data related to neural network operations in the context of the present disclosure) from the processing device 203 via the interface device 202, writing to a storage device on-chip with the computing device 201. Further, the computing apparatus 201 may obtain the control instruction from the processing apparatus 203 via the interface apparatus 202, and write the control instruction into the control cache on the computing apparatus 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the Processing device 203 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general purpose and/or special purpose Processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered together with the integration of the computing device 201 and the processing device 203, both are considered to form a heterogeneous multi-core structure. In some implementation scenarios, the processing device 203 in the heterogeneous multi-core structure herein may compile instruction code embodying a neural network model to form a binary instruction sequence that may be executable by the computing device 201.

DRAM204 is used to store data to be processed and may be, in one implementation scenario, a DDR memory, which may be typically 16G or larger in size, for storing data of computing device 201 and/or processing device 203.

Fig. 3 shows an internal structure diagram of the computing apparatus 201 as a single core. The mononuclear computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining and the like, and the mononuclear computing device 301 includes three modules: a control module 31, an arithmetic module 32 and a storage module 33. The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decode unit 312 decodes the obtained instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information. When implementing particular aspects of the present disclosure, the instructions herein may be general convolution instructions for performing matrix multiplication or convolution operations.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operation, and can support complex operations such as vector multiplication, addition, nonlinear transformation, etc.; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., the matrix multiplication and convolution operations mentioned in the context of this disclosure. The storage module 33 is used to store or transport related data, and includes a Neuron storage unit (Neuron RAM, NRAM) 331, a parameter storage unit (Weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333.NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; WRAM 332 is used to store convolution kernels of the deep learning network, i.e., weights, such as convolution kernels with step size greater than 1 and size of 1 × 1 in the present disclosure. The DMA 333 is connected to the DRAM204 via the bus 34 and is responsible for data transfer between the single-core computing device 301 and the DRAM 204.

Fig. 4 shows a schematic diagram of the internal structure of the computing apparatus 201 with multiple cores. The multi-core computing device 41 may be of a hierarchical design and may operate as a system on a chip, which may include at least one cluster (cluster) according to the present disclosure, each cluster in turn including a plurality of processor cores. In other words, the multi-core computing device 41 is constructed in a system-on-chip-cluster-processor core hierarchy. Looking at the system-on-chip hierarchy, as shown in fig. 4, the multi-core computing device 41 includes an external storage controller 401, a peripheral communication module 402, an on-chip interconnect module 403, a synchronization module 404, and a plurality of clusters 405.

There may be a plurality (e.g., 2 as illustrated) of external memory controllers 401, which are configured to access an external memory device, i.e., an off-chip memory (e.g., DRAM204 in fig. 2) in the context of the present disclosure, in response to an access request issued by the processor core, to read data from or write data to the off-chip memory. The peripheral communication module 402 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 403 connects the external memory controller 401, the peripheral communication module 402 and the plurality of clusters 405 for transmitting data and control signals between the respective modules. The synchronization module 404 is a Global synchronization Barrier Controller (GBC) for coordinating the work progress of the clusters and ensuring information synchronization. The plurality of clusters 405 of the present disclosure are computing cores of the multi-core computing device 41. Although 4 clusters are exemplarily shown in fig. 4, as hardware evolves, the multi-core computing device 41 of the present disclosure may also include 8, 16, 64, or even more clusters 405. In one application scenario, the cluster 405 may be used to efficiently execute a deep learning algorithm.

At the cluster level, as shown in FIG. 4, each cluster 405 may include a plurality of processor cores (IPU core) 406 and a memory core (MEM core) 407.

The processor cores 406 are exemplarily shown as 4 in the figure, the present disclosure does not limit the number of the processor cores 406, and the internal architecture thereof is as shown in fig. 5. Each processor core 406 is similar to the single core computing device 301 of fig. 3, and as such may include three modules: a control module 51, an arithmetic module 52 and a storage module 53. The functions and structures of the control module 51, the operation module 52 and the storage module 53 are substantially the same as those of the control module 31, the operation module 32 and the storage module 33, and are not described herein again. It should be particularly noted that the storage module 53 may include an Input/Output Direct Memory Access (IODMA) module 533 and a transport Direct Memory Access (MVDMA) module 534.IODMA 533 controls access of NRAM 531/WRAM 532 and DRAM204 through broadcast bus 409; the MVDMA534 is used to control access to the NRAM 531/WRAM 532 and the memory cell (SRAM) 408.

Returning to FIG. 4, the storage core 407 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 406, as well as perform communications between the cluster 405 and the DRAM204, communications among each other cluster 405, communications among each other processor cores 406, and the like. In other embodiments, memory core 407 may have the capability of scalar operations to perform scalar operations.

The Memory core 407 may include a Static Random-Access Memory (SRAM) 408, a broadcast bus 409, a Cluster Direct Memory Access (CDMA) 410, and a Global Direct Memory Access (GDMA) 411. In one implementation scenario, SRAM 408 may assume the role of a high performance data transfer station. Thus, data multiplexed between different processor cores 406 within the same cluster 405 need not be individually obtained by the processor cores 406 to the DRAM204, but rather is relayed between the processor cores 406 via the SRAM 408. Further, the memory core 407 only needs to quickly distribute multiplexed data from the SRAM 408 to the plurality of processor cores 406, so that it is possible to improve inter-core communication efficiency and significantly reduce off-chip input/output access.

The broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between the processor cores 406, communication between the cluster 405, and data transmission between the cluster 405 and the DRAM204, respectively. As will be described separately below.

The broadcast bus 409 is used to complete high-speed communication among the processor cores 406 in the cluster 405, and the broadcast bus 409 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (e.g., from a single processor core to a single processor core) data transfer, multicast is a communication that transfers a copy of data from SRAM 408 to a particular number of processor cores 406, and broadcast is a communication that transfers a copy of data from SRAM 408 to all processor cores 406, which is a special case of multicast.

CDMA 410 is used to control access to SRAM 408 between different clusters 405 within the same computing device 201. The GDMA 411 cooperates with the external memory controller 401 to control the access of the SRAM 408 of the cluster 405 to the DRAM204 or to read data from the DRAM204 into the SRAM 408. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 can be achieved via 2 ways. The first way is to communicate with NRAM 431 or WRAM 432 directly with DRAM204 through IODAM 433; the second way is to transmit data between the DRAM204 and the SRAM 408 through the GDMA 411, and transmit data between the SRAM 408 and the NRAM 431 or WRAM 432 through the MVDMA 534. Although the second approach may require more components and longer data flow, in some embodiments, the bandwidth of the second approach is substantially greater than the first approach, and thus it may be more efficient to perform communication between DRAM204 and NRAM 431 or WRAM 432 in the second approach. It is understood that the data transmission schemes described herein are merely exemplary, and those skilled in the art can flexibly select and adapt various data transmission schemes according to the specific arrangement of hardware in light of the teachings of the present disclosure.

In other embodiments, the functions of GDMA 411 and IODMA 533 may be integrated in the same component. Although the present disclosure considers GDMA 411 and IODMA 533 as different components for convenience of description, it will be within the scope of protection of the present disclosure for a person skilled in the art as long as the achieved functions and technical effects are similar to the present disclosure. Further, the functions of GDMA 411, IODMA 533, CDMA 410, and MVDMA534 may be implemented by the same component.

The hardware architecture and its internal structure of the present disclosure are described in detail above in conjunction with fig. 1-5. It is to be understood that the above description is intended to be illustrative, and not restrictive. According to different application scenarios and hardware specifications, those skilled in the art may also change the board card and the internal structure of the present disclosure, and these changes still fall into the protection scope of the present disclosure. Based on the above exemplary architecture, the following scheme of the present disclosure proposes to optimize convolution operations, and in particular, to convert convolution operations involving a convolution step size greater than 1 and a convolution kernel size of 1 × 1, so as to convert the convolution operations into matrix multiplication operations with higher execution efficiency, thereby accelerating the operation speed of the system on chip and significantly improving the operation performance thereof.

FIG. 6 illustrates a flow diagram of a method 600 for optimizing a neural network model running on a system-on-chip, in accordance with an embodiment of the present disclosure. In the method, the neural network model may operate using a convolution kernel having a size of 1 × 1 and a step size greater than 1. Here, the convolution kernel having a size of 1 × 1 refers to a convolution kernel having a size of 1 × 1 (e.g., 1 row and 1 column), and the convolution kernel having a step size greater than 1 may be such that the number of steps of the 1 × 1 convolution kernel moving to the right or downward each time in the input feature map is greater than 1 (e.g., 2, 3, 4, etc.).

As shown in fig. 6, at step 601, valid data to be subjected to convolution operation with a convolution kernel in input data is acquired. In one implementation scenario, the valid data to be subjected to the convolution kernel for the data convolution operation may be input data participating in the convolution operation for the convolution kernel. To facilitate understanding of valid data or invalid data in the convolution operation, the valid data and invalid data of the present disclosure will be exemplified below in conjunction with the convolution operation shown in fig. 7. As known to those skilled in the art, the relevant layers (e.g., convolutional layers) in the neural network model may perform convolutional arithmetic operations. In actual practice, the convolution operation can be considered as performing a matrix inner product on the input feature matrix (i.e., the input data in the present disclosure) and the convolution kernel.

As shown in the example of fig. 7, taking the convolutional layer as an example, the input feature matrix X (i.e., the input data in the present disclosure) of the convolutional layer is a matrix of 5 × 6 (5 rows and 6 columns, i.e., 0 th row to 4 th row and 0 th column to 5 th column), and the size thereof is as shown in fig. 7; the convolution kernel K is a1 × 1 matrix with a size of 2 and is set to have a sliding step (i.e., step size) of 2 on the input feature matrix X, and the output feature matrix Y of the convolution layer is a3 × 3 (3 rows and 3 columns, i.e., 0 th to 2 nd rows and 0 th to 2 nd columns) matrix. For calculating a first value Y of an output characteristic matrix Y _0,0 The convolution kernel K may be aligned to the (0, 0) (i.e., row 0, column 0) position of the input feature matrix X. Then, the input data at the corresponding position of the input feature matrix X is multiplied by the corresponding weight in the convolution kernel and then summed (i.e., weighted summation), so that: y is _0,0 ＝X _0,0 ×K _0,0 And (2). Similarly, to calculate the second value Y of the output feature matrix Y _0,1 The convolution kernel K can be slid one step to the right (i.e. two columns to the right, considering a step size of 2 in this example) in the input feature matrix X. At this point, the convolution kernel K is aligned to the (0, 2) (i.e., row 0, column 2) position of the input feature matrix X. Then, the input data at the corresponding position of the input feature matrix X and the corresponding weight of the convolution kernel may be multiplied and then summed, so as to obtain: y is _0,1 ＝X _0,2 ×K _0,0 And (2). Similarly, a fourth value Y of the output matrix Y may be calculated _1,0 ＝X _2,0 ×K _0,0 =2; and the rest is done until the convolution operation result shown on the right side of fig. 7, namely the output feature matrix Y, is obtained.

From the convolution operation shown in fig. 7 described above, it can be seen that the value "1" participating in the weighted summation operation with the convolution kernel K in the input feature matrix X "But not participating in the weighted sum operation with the convolution kernel K is the input data of value "0". Thus, in the example of fig. 7, the valid data of the input feature matrix X that is subjected to convolution operation with the convolution kernel K is data at the following positions: x _0,0 、X _0,2 、X _0,4 、X _2,0 、X _2,2 、X _2,4 、X _4,0 、X _4,2 、X _4,4 . In contrast, the data at the remaining positions in the input feature matrix X is regarded as invalid data, i.e., the input data having a value of "0". It is to be understood that specific values of valid data and invalid data are represented by "1" and "0" in this example, and such specific value setting is merely exemplary and not restrictive. The valid data and invalid data may have values other than "1" or "0" according to different practical operation scenarios.

Based on the above exemplary description of valid data and invalid data, the present disclosure proposes to determine valid data and invalid data in input data according to a position index of the input data, a position index of output data as a result of an operation, and a step size of a convolution kernel. In one implementation scenario, the aforementioned location index may be a location parameter used to indicate where the data is located. Still taking the example shown in fig. 7, the position parameter in the input feature matrix X may be a specific position in the matrix, such as (0, 2) (i.e. row 0, column 2). Similarly, the location parameter in the output characteristic matrix Y may be a parameter indicating a specific location of the output data in the matrix, such as (0, 0) (i.e., row 0, column 0).

In one implementation scenario, the input data (e.g., input neuron data) and the output data (e.g., output neuron data) of the present disclosure may each be four-dimensional tensor data and have a universal data layout format of NHWC, where N denotes the number of batches processed (Batch size), H denotes the height of the tensor data, W denotes the width of the tensor data, and C denotes the number of channels of the tensor data. Specifically, for the input data, it may be a four-dimensional tensor with the format NHiWiCi; similarly, for the output data, it may have a four-dimensional tensor in the format NHoWoCo. Here, "i" represents an input and "o" represents an output. On the basis of this data format, the aforementioned position index of the present disclosure may include sequence numbers of the input data and the output data in a row and column dimension, for example, a row number in the row dimension and a column number in the column dimension. In this regard, when the aforementioned row and column dimensions are represented by, for example, hi x Wi as shown in fig. 8, hi idx may be used to represent the input data row number and Wi idx may be used to represent the input data column number. Likewise, ho _ idx may be used to represent an output data row number and Wo _ idx may be used to represent an output data column number. Similarly, the step size (stride) of the convolution kernel may include a height value and a width value of the step, where the height value may be denoted as "stride _ h" and the width value may be denoted as "stride _ w".

Based on the data placement format and data representation described above, in one implementation scenario, valid data to be convolved with a convolution kernel in the input data of the present disclosure may include input data relating to valid data rows and valid data columns. In other words, the data in the valid data row and the valid data column is the valid data for performing the convolution operation, i.e. the valid data participating in the matrix multiplication operation.

Considering that the size of the input data is known before the operation, and since the convolution kernel of the present disclosure has the property of a size of 1 × 1 and a step size greater than 1, the size information of the output data (e.g., the size of the output data is determined by the convolution kernel moving over the input data), including its row and column numbers, can be known in advance using the size information of the input data and the convolution kernel property. In addition, in view of the correspondence between the row and column numbers of the output data and the row and column numbers of the data actually involved in the convolution operation (i.e., valid data) in the input data, the present disclosure proposes to determine the aforementioned valid data row by the following formula:

Hi_idx＝Ho_idx*stride_h (1)

Hi_idx≠Ho_idx*stride_h (2)

when the input data line number satisfies formula (1), it can be determined that the input data is in a valid data line. On the contrary, when the input data line number satisfies formula (2), it may be determined that the input data is in an invalid data line.

Similar to the above equations (1) and (2), the valid data column may be determined based on the following equations (3) and (4):

Wi_idx＝Wo_idx*stride_w (3)

Wi_idx≠Wo_idx*stride_w (4)

when the input data column number satisfies formula (3), it can be determined that the input data is in a valid data column. On the contrary, when the input data column number satisfies formula (4), it may be determined that the input data is in an invalid data column. After the valid data row and valid data column are determined, valid data in the input data can be determined according to the valid row number and the column number.

The following describes schematically how to determine valid data rows and invalid data rows in the input feature data X by using the row-column dimension information of the output feature data Y, taking the input feature data X and the output feature data Y shown in fig. 7 as an example. Specifically, as for the output characteristic data Y, the output data size of "3 × 3", that is, the 0 th to 2 nd rows and the 0 th to 2 nd columns can be determined in advance based on the input data size ("5 × 6", that is, the 0 th to 4 th rows and the 0 th to 5 th columns in fig. 7) and the convolution kernel attribute (the step size is 2 and the size is "1 × 1"), so that the output data row numbers are 0, 1, and 2. Therefore, the 0 th, 2 nd, and 4 th row of the input characteristic data X may be determined based on the formula Hi _ idx = Ho _ idx strand _ h (where strand _ h = 2). Similarly, since the data column numbers of the output characteristic data Y are 0, 1, and 2, the 0 th, 2 nd, and 4 th row valid data columns of the input characteristic data X can be determined based on the formula Wi _ idx = Wo _ idx × stride _ w. Finally, the valid data in the input data X, that is, the data at the positions (0, 0), (0, 2), (0, 4), (2, 0), (2, 2), (2, 4), (4, 0), (4, 2) and (4, 4) in the input data X, can be determined according to the valid data row and the valid data column.

The effective data of the present disclosure are described in detail above. Returning to the flow of fig. 6, after valid data to be subjected to convolution operation with the convolution kernel in the input data is acquired, the flow proceeds to step 602. At step 602, valid data may be stored contiguously in the system on chip to convert the convolution operation to a matrix multiply operation. The present disclosure proposes that the continuous storage of valid data at the system-on-chip takes into account the following factors: 1) The presence of invalid data in the input data, resulting in the valid data being non-contiguous; 2) Matrix multiplication requires that the two matrices participating in the matrix multiplication operation be stored consecutively. In view of the foregoing two factors, the present disclosure proposes to store valid data continuously, so that the influence of invalid data in the input data on the continuity of the valid data can be eliminated, and thus the convolution operation of the input data and the convolution kernel can be converted into a matrix multiplication operation between the valid data and the convolution kernel. To illustrate how valid data is extracted from input data for continuous storage, the following is described in detail in conjunction with fig. 8.

Fig. 8 is a schematic diagram illustrating a method for converting a convolution operation to a matrix multiplication operation according to an embodiment of the present disclosure. Specifically, two embodiments of continuously storing valid data are schematically illustrated in fig. 8.

The first embodiment:

as shown in the left side of fig. 8, the input data is a four-dimensional matrix composed of four-dimensional tensor data, hi represents a row dimension of the input data (5 rows as shown in the figure), wi represents a column dimension of the input data (4 columns as shown in the figure), and Ci represents the number of channels of the input data, where the "N" dimension described above is not shown here for the purpose of simplifying the explanation only. In one implementation scenario, the input data here may be stored on an off-chip system (e.g., dynamic random access memory DRAM204 as shown in fig. 2).

For ease of understanding, the four-dimensional matrix is transformed in two dimensions to obtain 20 rows of input data as shown in the Hi x Wi dimension in the middle of FIG. 8, where each row of data corresponds to one matrix element in the left matrix. In other words, the left matrix of fig. 8 extracts each element in the order of "from top to bottom, left to right" and lays along the Ci dimension, so that the data laid along the 1 st to 20 th rows as shown in the middle of fig. 8, such as the 1 st to 3 rd rows as shown in the figure, and up to the 20 th row, can be obtained.

As can be seen from the foregoing description of how to determine valid data of input data, for a convolution operation of a convolution kernel with a size of 1 × 1 and a step size larger than 1 (e.g., a step size of 2), if the size of input data of the convolution operation is 5 × 4 (i.e., 5 rows and 4 columns), the size of output data is 3 × 2 (i.e., 3 rows and 2 columns). Then, it can be determined by the foregoing formulas (1) to (4) that the valid data of the input data on the left side of fig. 8 is the data located at these positions: row 0, column 0, row 0, column 2, row 2, column 0, row 2, column 2, row 4, column 0 and column 2 (i.e., shown as gray squares in the figure), whereas the input data at these locations are not invalid data (i.e., shown as white squares in the figure). In correspondence with the valid data and invalid data in the aforementioned input data, of the input data shown in the two-dimensional view, data in the 1 st, 3 rd, 9 th, 11 th, 17 th and 19 th lines are valid data (as gray lines in the figure), and data in the remaining lines (as white lines in the figure) are invalid data (as shown in the 2 nd and 20 th lines in the figure).

After the valid data and the invalid data are determined as above, the present disclosure proposes selectively reading only the valid data from the off-chip system storing the input data and successively storing the read valid data in the on-chip system for matrix multiplication operation. In connection with the example shown in fig. 8, valid data located in the 1 st, 3 rd, 9 th, 11 th, 17 th and 19 th rows can be selectively read to the system on chip from the input data in a two-dimensional view (as shown by "GDRAM2NRAM" and arrows at 801 in fig. 8, where "GDRAM" represents a memory of the system off chip, "2" represents "to", and NRAM represents a memory of the system on chip, such as NRAM 331 shown in fig. 3), so as to form continuously stored data in the system on chip, i.e., forming an a matrix shown in the figure.

The second embodiment:

unlike the first embodiment described above, in which valid data is selectively read from an off-chip system to an on-chip system, in the second embodiment, the present disclosure proposes to read input data directly into the on-chip system in a format shown in a two-dimensional view, as shown by "GDRAM2NRAM" and arrows at 803. In one implementation scenario, in the aforementioned direct read process, the present disclosure loads only a portion of the input data to the system-on-chip, the portion of the input data being associated with valid data and including partial invalid data. In particular, the present disclosure proposes loading a valid data line (including invalid data in the valid data line) to a system on chip. For example, when the row 0 of the input data is determined to be the valid data row, the data of the rows 1 to 4 (i.e. the row 0 in the left matrix) in the corresponding two-dimensional view angle may be loaded to the system on chip, where the

rows

2 and 4 are the aforementioned partially invalid data. Similarly, data corresponding to the 2 nd row of valid data (i.e., data of the 9 th to 12 th rows in a two-dimensional view) is loaded to the system-on-chip, where the 9 th and 11 th rows are valid data and the 10 th and 12 th rows are invalid data. And so on until the loading of the valid data line to the system on chip is completed. Next, a data MOVE operation (i.e., "MOVE" shown at 805 in FIG. 8) may be performed directly on the input data at the system-on-chip to successively store valid data including data as indicated by 1, 3, 9 (which is a row number in a two-dimensional view) and so on in the bottom-right matrix in FIG. 8, resulting in an A-matrix. In one scenario, the aforementioned read and data move operations may be performed in parallel, pipelined, or the like. In particular, the system-on-chip may be provided with two memory areas and in operation one memory area is used for receiving one line of data from the system-off-chip (i.e. one valid data line or invalid data line) and the other memory area is used for participating in performing a data movement operation (i.e. reserving the valid data line and deleting the invalid data line). In order to speed up the move operation, the two memory blocks may alternately perform the operations of receiving and participating in the move, i.e. the memory block receiving the data may subsequently participate in the move operation, while the memory block participating in the move operation is used to receive a valid data line or an invalid data line from the off-chip system.

In some implementation scenarios, the input data is tensor data that includes at least the dimensions of the input channel. Based on this, considering the bandwidth factor when input data is transferred between an off-chip system and an on-chip system, the present disclosure proposes to select to store valid data continuously on the on-chip system in different ways (i.e., the above first or second embodiments) according to the comparison of the size of the input channel dimension (i.e., ci shown in fig. 8) and the threshold value, so as to fully utilize the bandwidth at the time of data transfer. Specifically, the scheme of the disclosure proposes to compare the number of channels of the input data with a preset threshold. When the number of channels of the input data is greater than the preset threshold, the valid data is continuously stored in the system on chip according to the first embodiment. In contrast, when the number of channels of the input data is less than or equal to the preset threshold, the valid data is continuously stored in the system on chip according to the second embodiment.

In one implementation scenario, the threshold as above may be set according to one or more of the size of the convolution kernel step size, the size of the input channel dimension, and the data bit width of the input data. Based on this, the threshold value can be expressed, for example, as: 256/(stride _ w × sizero (input _ type)), where "sizero (input _ type)" represents the bit width value of the data type of the input data, e.g., a 32-bit floating point number having a bit width value of 32.

Based on the above threshold expression, it can be judged according to the following inequality conditions that the first or second embodiment is adopted to continuously store the valid data, thereby improving the bandwidth utilization rate when the data is read from off-chip to on-chip:

when Ci > 256/(stride _ w × size of (input _ data)), the above-described "first embodiment" may be employed to perform continuous storage of valid data;

when Ci is less than or equal to 256/(stride _ w × size of (input _ data)), the valid data can be continuously stored by the above-described "second embodiment".

The continuous storage of the present disclosure is described above in connection with fig. 8, returning to the flow in fig. 6. After valid data in the input data is continuously stored in the system on chip, the flow proceeds to step 603. At step 603, a matrix multiplication operation between the valid data and the convolution kernel may be performed at the system-on-chip to obtain output data. Taking the example shown in fig. 8 as an example, after valid data in the input data is continuously stored, an input left matrix of matrix multiplication operation, that is, an a matrix in fig. 8, can be obtained: m Ci. It will be appreciated that the memory space of the input left matrix of the matrix multiplication operation is now contiguous. In addition, since the size of the convolution kernel performing the convolution operation in the present embodiment is 1 × 1, the storage space of the right matrix (i.e., the B matrix: N × Ci in fig. 8) which is the input of the matrix multiplication operation is also continuous. Thus, the scheme of the present disclosure successfully converts the convolution operation between the input data and the convolution kernel into a matrix multiplication operation between the valid data of the system on chip and the convolution kernel (as shown by "Conv" in fig. 8), thereby obtaining a continuously stored output matrix, that is, the output data of the present disclosure.

The scheme of converting the convolution operation of the input data and the convolution kernel into the matrix multiplication operation in the present disclosure is described above with reference to fig. 6 to 8. It can be understood that the scheme of the present disclosure can be applied in the context of various convolution operations of neural network models. Specifically, the operation of the neural network model may include convolution operations in forward propagation and/or backward propagation, and thus the present disclosure may convert the convolution operations in forward propagation and/or backward propagation into a matrix multiplication operation between effective data and a convolution kernel by continuously storing the effective input data. Regarding the application of the present solution in the forward propagation, the description made above with reference to fig. 8 can be applied to such an application, and therefore, the detailed description thereof is omitted here. The application of the present solution to the back propagation of the neural network model will be described below with reference to fig. 9.

As known to those skilled in the art, a neural network model typically includes an input layer, an output layer, and one or more hidden layers located between the input layer and the output layer, including, for example, convolutional layers, pooling layers, active layers, fully-connected layers, and the like. In the back propagation process of the neural network model, for each layer, the input neuron gradient vector (i.e., the input gradient data of the present disclosure) of the current layer is first weighted and summed to calculate the output neuron gradient vector (also referred to as the output gradient data) of the current layer. The output neuron gradient vector is multiplied by the derivative value of the activation function of the next layer in the backward propagation direction in the forward propagation operation to obtain the input neuron gradient vector of the previous layer. In addition, the gradient vector of the input neuron of the current layer is multiplied by the input neuron data in the forward propagation operation in a counterpoint mode to obtain the gradient of the weight of the current layer. Finally, the weight of the current layer can be updated according to the gradient of the obtained weight of the current layer, so that the operation of updating the weight of the current layer is completed.

In the application example shown in fig. 9, in combination with the above description, the a matrix may represent input gradient data input to the current layer from the previous layer in the backward propagation direction, the B matrix may represent a convolution kernel, and the C matrix may represent a resultant matrix obtained by performing a matrix multiplication operation as the a matrix and the B matrix, which is output gradient data output from the current layer to the next layer in the backward propagation direction (when the next layer is converted into the current layer, the output gradient data is also input gradient data of the current layer). In one embodiment, the a matrix may be a matrix including only valid data obtained by performing the continuous storage scheme as described above on the input gradient data including valid data and invalid data. Accordingly, the B matrix may be weight data loaded from an off-chip system to an on-chip system for residency. Since both the a matrix and the B matrix are data stored continuously in the system on chip and the step size is larger than 1 (for example, the case of step size 2 as described above), the convolution operation performed by the a matrix and the B matrix can be converted into a matrix multiplication operation, and the resulting output gradient data (i.e., the C matrix) is valid data containing data stored continuously as described above.

As described above, in order to obtain the gradient of the current layer weight, the input neuron gradient vector of the current layer needs to be operated (e.g., bit-wise multiplied) with the input neuron data in the forward propagation operation to obtain the gradient of the current layer weight. In view of this, when the C matrix in fig. 9 is used as the input neuron gradient vector of the current layer, it needs to be converted so as to make the size of the C matrix consistent with the size of the input neuron in the forward propagation operation. To this end, the present disclosure proposes to perform a shift operation similar to the foregoing on the C matrix in order to have the input neuron gradient data have the same size as the input neuron data at the time of the forward propagation operation.

First, using the scheme described above in connection with fig. 6-8, valid and invalid data of the input neuron data in forward propagation of the current layer, i.e. valid and invalid data lines in the input neuron data, can be determined. Then, a corresponding shift of the output gradient data may be performed according to the aforementioned row and column dimensions of the valid data and invalid data, so as to shift the output gradient data to the row and column dimensions corresponding to the valid data and insert the invalid data at the dimensions corresponding to the invalid data. Specifically to the example in fig. 9, by performing a MOVE operation on the C matrix, the data in the C matrix is placed at the position of the valid data rows or the valid data rows are spaced apart (as shown in fig. 9, the matrix obtained after the "MOVE" operation). Next, an operation of loading the valid data line from the system-on-chip to the system-off-chip may be performed (as shown by "NRAM2GDRAM" in the figure). For invalid data lines, invalid data (e.g., zeros) may be inserted directly into the line during the load to the off-chip system. Through such a move and insert operation, a "Diff _ X" matrix as shown in FIG. 9, i.e., input neuron gradient vectors having the same size as the input neuron data, can be obtained.

Fig. 10 is a detailed flow diagram illustrating a method 1000 for optimizing a neural network model in accordance with an embodiment of the present disclosure. From the foregoing description, those skilled in the art will appreciate that the flow shown in fig. 10 is one possible implementation of the scheme shown in fig. 6, and encompasses what is discussed in fig. 7-9. Therefore, for the same operations as before, only a brief description will be made below and no further description will be given.

As shown, at step 1001, convolution kernel data is loaded from an off-chip system to an on-chip system. From the foregoing, the convolution kernel data here has the property of a size of 1 × 1 and a step size of 2 or more. Next, at step 1002, input neurons are obtained and valid data and invalid data in the input neurons are determined at step 1003. As described above, the valid data is data that actually participates in the convolution operation, and the invalid data does not participate in the convolution operation and causes discontinuity in storage of the valid data. Thereafter, it is determined at step 1004 whether the size of the input channel dimension of the input neuron is greater than a threshold (e.g., 256/(stride _ w × size of (input _ type)).

If the size of the input channel dimension of the input neuron is greater than the threshold, then flow proceeds to step 1005. At step 1005, valid data in the input neurons is selectively read from the off-chip system and at step 1006 the valid data is continuously stored in the on-chip system, i.e., the operation at 801 in FIG. 8 above. In contrast, when the size of the input channel dimension of the input neuron is less than or equal to the threshold, the flow proceeds to step 1007. At step 1007, input data is read from the off-chip system to the on-chip system. Next, at step 1008, a data movement operation (e.g., deleting invalid data in the input data) is performed based on valid data and invalid data in the input data to obtain valid data that is continuously stored. After the continuously stored valid data is obtained, at step 1009, a matrix multiplication operation is performed on the obtained continuously stored valid data and convolution kernel data, thereby converting the convolution operation into a matrix multiplication operation.

In a back propagation corresponding to the forward propagation described above, at step 1101, convolution kernel data may be loaded to the system-on-chip. Next, at step 1102, input gradient data for the previous layer in the counter-propagation direction (i.e., the input neuron gradient vectors described in connection with fig. 9) is obtained. Next, given that both are continuously stored data at the system-on-chip, a matrix multiplication operation of the input gradient data and convolution kernel data is performed at step 1103. Thus, output gradient data for the current layer is obtained at step 1104. Next, in order to obtain a gradient of the weight in order to update the weight, at step 1105, a corresponding shift operation on the output gradient data is performed according to the row-column dimension of the valid data and the invalid data of the input neuron data in the forward propagation of the current layer, so that the input neuron gradient data has the same size as the input neuron data at the time of the forward propagation operation, thereby complying with the requirement of weight gradient calculation (e.g., the input neuron gradient data is multiplied by the para-position of the input neuron data).

The aspects of the present disclosure are described in detail above with reference to the drawings. According to various application scenarios, an electronic device or apparatus of the present disclosure may include a processor, a memory and a compiler, where the memory stores computer program code for a method of optimizing a neural network model running on a system-on-chip of the present disclosure. Correspondingly, the aforementioned compiler compiles the computer program code under control of the processor to generate a sequence of binary instructions for performing the method, wherein the sequence of binary instructions is adapted to be executed by the artificial intelligence processor.

Further, the electronic device or apparatus of the present disclosure may further include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an internet of things terminal, a mobile phone, a tachograph, a navigator, a sensor, a camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, the computationally-powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while the less-power electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, it will be appreciated by those skilled in the art in light of the disclosure or teachings of the present disclosure that certain steps therein may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are divided based on the logic functions, and there may be other dividing manners in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of the connection relationships between the different units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when aspects of the present disclosure are embodied in the form of a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in embodiments of the present disclosure. The aforementioned Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory ("Read Only Memory", abbreviated as ROM), a Random Access Memory ("Random Access Memory", abbreviated as RAM), a removable hard disk, a magnetic disk or an optical disk, and various media capable of storing program codes. Further, aspects of the present disclosure may be embodied in a computer readable storage medium, which may include program instructions for optimizing a neural network model running on a system on chip. The program instructions, when executed by a processor, may implement the optimization methodology described previously in this disclosure.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In view of this, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as CPUs, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory ("Resistive Random Access Memory", abbreviated as RRAM), a Dynamic Random Access Memory ("Dynamic Random Access Memory", abbreviated as DRAM), a Static Random Access Memory ("Static Random Access Memory", abbreviated as SRAM), an Enhanced Dynamic Random Access Memory ("Enhanced Dynamic Random Access Memory", abbreviated as EDRAM), a High Bandwidth Memory ("High Bandwidth Memory", abbreviated as HBM), a Hybrid Memory Cube ("Hybrid Memory Cube", abbreviated as HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1. A method for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using a convolution kernel having a size of 1x1 and a step size greater than 1, the method comprising:

obtaining effective data to be subjected to convolution operation with the convolution kernel in input data;

continuously storing the effective data on a system on chip so as to convert the convolution operation into a matrix multiplication operation; and

performing, at the system-on-chip, the matrix multiplication operation between the valid data and the convolution kernel to obtain output data.

Clause a2. The method of clause A1, wherein obtaining valid data of the input data to be convolved with the convolution kernel comprises:

and determining valid data and invalid data in the input data according to the position index of the input data, the position index of output data as an operation result and the step size of the convolution kernel, wherein the invalid data is data which does not participate in the convolution operation in the input data.

Clause a3. The method of clause A2, wherein the input data and the output data are both four-dimensional tensor data, wherein the position index comprises the ordinal number of the input data and the output data in the row and column dimensions, and the step size of the convolution kernel comprises a height value and a width value of the step.

Clause a4. The method of clause A1, wherein the input data is tensor data including at least input channel dimensions, wherein the continuously storing valid data in the input data on a system-on-chip comprises:

selecting to continuously store the valid data on the system-on-chip in different ways according to comparison of the size of the input channel dimension with a threshold.

Clause a5. The method of clause A4, wherein the size of the input channel dimension is greater than the threshold, and wherein continuously storing valid data in the input data on a system-on-chip comprises:

selectively reading only the valid data from an off-chip system storing the input data; and

the valid data read is successively stored in a system on chip for the matrix multiplication operation.

Clause a6. The method of clause A4, wherein the size of the input channel dimension is less than or equal to the threshold, and wherein continuously storing the valid data on the system-on-chip comprises:

reading a portion of input data associated with the valid data from an off-chip system to the on-chip system; and

and executing a data moving operation on the part of the input data by the system on chip so as to remove invalid data in the part of the input data and enable the valid data to be stored continuously.

Clause a7. The method of clause A4, wherein the threshold value is determined according to one or more of the following:

the size of the step size of the convolution kernel;

a size of the input channel dimension; and

a data bit width of the input data.

Clause a8. The method according to any of clauses A1-A7, wherein the operation of the neural network model comprises a convolution operation in forward propagation and/or backward propagation, wherein the method comprises:

and converting the convolution operation in the forward propagation and/or the backward propagation into a matrix multiplication operation between the effective data and the convolution kernel through the continuous storage.

Clause a9. The method of clause A8, wherein when the operation of the neural network model is a convolution operation in the forward propagation, the input data comprises input neuron data and the output data comprises output neuron data.

Clause a10. The method according to clause A8 or A9, wherein when the operation of the neural network model is a convolution operation in the back propagation, the input data includes input gradient data input from a previous layer in the back propagation direction to a current layer, and the output data includes output gradient data output from the current layer to a layer above and below the back propagation direction.

Clause a11. The method of clause a10, wherein for the output gradient data as a result of a matrix multiplication operation, the method further comprises:

performing a corresponding movement of output gradient data according to row-column dimensions of valid data and invalid data of input neuron data of the current layer in forward propagation so as to move the output gradient data to row-column dimensions corresponding to the valid data and insert invalid data at dimensions corresponding to the invalid data.

Clause a12. The method of clause A8, further comprising:

convolution kernel data associated with the convolution kernel is read from an off-chip system and resident on the on-chip system for use in the matrix multiplication operation with the valid data.

Clause a13. A system on a chip comprising a processor and a memory, and the memory storing program instructions which, when executed by the processor, implement the method according to any one of clauses A1-a 12.

Clause a14. An integrated circuit device comprising the system-on-chip according to clause a13.

Clause a15. A board card including the integrated circuit device of clause a14.

Clause a16. An apparatus for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using convolution kernels having a size of 1x1 and a step size greater than 1, the apparatus comprising:

a processor; and

a memory storing computer program code for performing the method according to any of clauses A1-a 12;

a compiler that compiles the computer program code under control of the processor to generate a sequence of binary instructions for performing the method, wherein the sequence of binary instructions is adapted to be executed by an artificial intelligence processor.

Clause a17. A computer-readable storage medium comprising program instructions for optimizing a neural network model running on a system-on-chip, which when executed by a processor, implement the method of any one of clauses A1-a 12.

While various embodiments of the present disclosure have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the present disclosure. It should be understood that various alternatives to the embodiments of the disclosure described herein may be employed in practicing the disclosure. It is intended that the following claims define the scope of the disclosure and that equivalents or alternatives within the scope of these claims be covered thereby.

Claims

1. A method for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using a convolution kernel having a size of 1x1 and a step size greater than 1, the method comprising:

performing the matrix multiplication operation between the valid data and the convolution kernel on the system-on-chip to obtain output data.

2. The method of claim 1, wherein obtaining valid data of the input data to be convolved with the convolution kernel comprises:

3. The method of claim 2, wherein the input data and output data are both four-dimensional tensor data, wherein the position index includes ordinal numbers of the input data and output data in a row and column dimension, and the step size of the convolution kernel includes height and width values of the steps.

4. The method of claim 1, wherein the input data is tensor data comprising at least input channel dimensions, wherein continuously storing valid data in the input data on a system-on-chip comprises:

5. The method of claim 4, wherein the size of the input channel dimension is greater than the threshold, and wherein continuously storing valid data of the input data on a system-on-chip comprises:

6. The method of claim 4, wherein a size of the input channel dimension is less than or equal to the threshold, and wherein continuously storing the valid data on a system-on-chip comprises:

performing, at the system-on-chip, a data movement operation on the portion of the input data to remove the invalid data in the portion of the input data and to cause the valid data to be stored contiguously.

7. The method of claim 4, wherein the threshold is determined according to one or more of:

the size of the step size of the convolution kernel;

a size of the input channel dimension; and

a data bit width of the input data.

8. The method of any one of claims 1-7, wherein the operation of the neural network model comprises a convolution operation in forward propagation and/or backward propagation, wherein the method comprises:

converting, by the sequential storage, convolution operations in the forward propagation and/or backward propagation into a matrix multiplication operation between the valid data and the convolution kernel.

9. The method of claim 8, wherein when the operation of the neural network model is a convolution operation in the forward propagation, the input data comprises input neuron data and the output data comprises output neuron data.

10. The method according to claim 8 or 9, wherein when the operation of the neural network model is a convolution operation in the back propagation, the input data includes input gradient data input from a previous layer in the back propagation direction to a current layer, and the output data includes output gradient data output from the current layer to a layer above and below in the back propagation direction.

11. The method of claim 10, wherein for the output gradient data as a result of a matrix multiplication operation, the method further comprises:

12. The method of claim 8, further comprising:

13. A system on a chip comprising a processor and a memory, and the memory storing program instructions which, when executed by the processor, carry out the method according to any one of claims 1-12.

14. An integrated circuit device comprising the system on a chip of claim 13.

15. A board card comprising the integrated circuit device of claim 14.

16. An apparatus for optimizing a neural network model running on a system-on-chip, wherein the neural network model operates using a convolution kernel having a size of 1x1 and a step size greater than 1, the apparatus comprising:

a processor; and

a memory storing computer program code for performing the method of any of claims 1-12;

17. A computer readable storage medium comprising program instructions for optimizing a neural network model running on a system-on-chip, which when executed by a processor, implement the method of any one of claims 1-12.