CN113469333B

CN113469333B - Artificial intelligence processor, method and related products for executing neural network model

Info

Publication number: CN113469333B
Application number: CN202110721919.XA
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2023-11-10
Anticipated expiration: 2041-06-28
Also published as: CN113469333A

Abstract

The present disclosure discloses an artificial intelligence processor, a processing method and related products for executing neural network models. The artificial intelligence processor may be implemented such that the computing device is included in a combined processing device that may also include interface devices and other processing devices. The computing device interacts with other processing devices to collectively perform a user-specified computing operation. The combined processing means may further comprise storage means connected to the computing means and the other processing means, respectively, for storing data of the computing means and the other processing means. The scheme of the disclosure provides a fusion processing scheme of an upper pooling layer and a deep convolution layer in a neural network model, which can effectively reduce the off-chip memory access bandwidth, relieve memory access pressure and improve the processing efficiency of a machine.

Description

Artificial intelligence processor, method and related products for executing neural network model

Technical Field

The present disclosure relates generally to the field of data processing. More particularly, the present disclosure relates to artificial intelligence processors, chips, boards, and methods of executing neural network models using artificial intelligence processors.

Background

Deep Learning (Deep Learning) has become an important branch in machine Learning, and has greatly assisted the development of Artificial Intelligence (AI). The core technology of deep learning, deep Neural Network (DNN), has found wide application in many industries.

To increase the expressive power of neural network models, DNNs are continually evolving towards deeper or wider network scales. However, the increase of network level also brings about the problems of large data IO amount, up-to-date access and the like. Therefore, in order to fully exploit the advantages of the neural network model, the problem of memory trouble faced by the artificial intelligence processor needs to be solved.

Disclosure of Invention

To at least partially address one or more of the technical problems noted in the background, aspects of the present disclosure provide an artificial intelligence processor, a chip, a board card, and a method of executing a neural network model using the artificial intelligence processor.

In a first aspect, the present disclosure discloses an artificial intelligence processor that executes a neural network model comprising an upper pooling layer and a deep convolution layer, wherein: the control circuit is used for controlling the loading of the input data of the upper pooling layer and the convolution kernel of the depth convolution layer from the off-chip storage circuit to the on-chip storage circuit; the operation circuit is used for executing fusion operation of the upper pooling layer and the depth convolution layer aiming at the input data and the convolution kernel, and writing a fusion operation result back to the on-chip storage circuit; and the control circuit is further used for controlling the output of the fusion operation result from the on-chip storage circuit to the off-chip storage circuit.

In a second aspect, the present disclosure provides a chip comprising an artificial intelligence processor of any of the embodiments of the first aspect described above.

In a third aspect, the present disclosure provides a board comprising the chip of any one of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a method of executing a neural network model using the artificial intelligence processor of any of the embodiments of the first aspect described above.

Through the artificial intelligence processor for executing the neural network model, the method for executing the neural network model by using the artificial intelligence processor, the chip and the board card, the embodiment of the disclosure provides a fusion optimization processing scheme of an upper pooling layer and a deep convolution layer in the neural network model, which can effectively reduce the off-chip access memory bandwidth, relieve the access memory pressure and improve the processing efficiency of a machine.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates a block diagram of a board of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combination processing device according to an embodiment of the present disclosure;

FIG. 3 illustrates an internal architecture schematic diagram of a processor core of a single-core or multi-core computing device of an embodiment of the present disclosure;

FIG. 4 shows an exemplary illustration of a neural network model to which embodiments of the present disclosure may be applied;

FIG. 5 illustrates a schematic diagram of the operation of the upper pooling layer;

FIG. 6 illustrates a schematic diagram of the operation of a deep convolutional layer;

FIG. 7 illustrates an exemplary operation of the upper pooling layer before fusion;

FIG. 8 illustrates an exemplary operation of a pre-fusion deep convolutional layer;

FIG. 9 illustrates a fusion operation process of an upper pooling layer and a deep convolutional layer of an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of an index mapping relationship of an embodiment of the present disclosure;

FIG. 11 illustrates an exemplary block diagram of an artificial intelligence processor in accordance with an embodiment of the present disclosure; and

FIG. 12 illustrates an exemplary flowchart of a method of executing a neural network model by an artificial intelligence processor in accordance with an embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification, and drawings of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context.

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present disclosure. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is applied to the cloud intelligent field in a large quantity, and the cloud intelligent application has the remarkable characteristics of large input data quantity and high requirements on the storage capacity and the computing capacity of a platform, and the board card 10 of the embodiment is suitable for the cloud intelligent application and has huge off-chip storage, on-chip storage and strong computing capacity.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 is a block diagram showing a combination processing apparatus in the chip 101 of this embodiment. As shown in fig. 2, the combined processing means 20 comprises computing means 201, interface means 202, processing means 203 and storage means 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in the memory device of the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.

The storage 204 is used to store data to be processed, which may be DRAM, is DDR memory, and is typically 16G or larger in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of the internal architecture of a processor core when the computing device 201 is a single-core or multi-core device. The computing device 301 is configured to process input data such as computer vision, voice, natural language, data mining, etc., and the computing device 301 includes three modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operation of the operation module 32 and the storage module 33 to complete the task of deep learning, and comprises a fetch unit (instruction fetch unit, IFU) 311 and an instruction decode unit (instruction decode unit, IDU) 312. The instruction fetching unit 311 is configured to fetch an instruction from the processing device 203, and the instruction decoding unit 312 decodes the fetched instruction and sends the decoded result to the operation module 32 and the storage module 33 as control information.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit 322 is responsible for the core computation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used for storing or carrying related data, including a neuron storage unit (NRAM) 331, a parameter storage unit (weight RAM, WRAM) 332, and a direct memory access module (direct memory access, DMA) 333.NRAM 331 is to store input neurons, output neurons, and calculated intermediate results; WRAM 332 is configured to store a convolution kernel, i.e., a weight, of the deep learning network; DMA 333 is coupled to DRAM 204 via bus 34 and is responsible for data handling between computing device 301 and DRAM 204.

Embodiments of the present disclosure provide an artificial intelligence processor that performs a neural network model based on the aforementioned hardware environment, supporting a fusion optimization process of an upper pooling layer and a deep convolutional layer in the neural network model.

FIG. 4 shows an exemplary illustration of a neural network model to which embodiments of the present disclosure may be applied.

As shown, the neural network model 400 includes two parts: a convolutional network portion 410 and a deconvolution network portion 420. The convolutional network part acts as a feature extractor, converting the input picture into a multi-dimensional feature representation. The deconvolution network part corresponds to a shape generator, and the characteristics extracted by the convolution network are generated into target segmentation results.

The convolutional network portion 410 includes a plurality of layers therein, which may include a convolutional layer 411, a pooling layer 412, etc., that are spaced apart. The convolutional layer may perform feature extraction by applying several filters to the input data. The pooling layer is mainly used for reducing the scale of input data and reducing the overfitting. There are many ways to achieve pooling, the most common being maximum combining, average combining and random combining. A linear layer may exist between the convolutional layer and the pooling layer.

Deconvolution network portion 420 is a mirror image of the convolutional network portion and may include an upper pooling layer 421, deconvolution layer 422, etc. that are spaced apart. In neural networks, since the size of the output tends to be small after the input image has been characterized by the convolutional layer, it is sometimes necessary to restore the image to its original size for further calculations, such as semantic segmentation of the image, which can be achieved by the upper pooling layer 421. The upper pooling layer 421 may be paired with the pooling layer 412 in the convolutional network part. The deconvolution layer is the inverse of the convolution layer that is used to recover the size of the picture before convolution. The operation of the deconvolution layer may be a deep convolution operation and is therefore sometimes referred to as a deep convolution layer. A linear layer may be present between the upper pooling layer and the deconvolution layer.

It will be appreciated that the above description of the neural network model is merely exemplary, and that the structure of the neural network model is not limited to the structure shown in the figures, and that modifications may be made to the structure shown in the figures as desired by those skilled in the art. The operation of the relevant layers is described in detail below.

FIG. 5 illustrates a schematic diagram of the operation of paired pooling layers and upper pooling layers. In this example, a max-pooling approach is employed. Note that typically pooling is not pooling in the channel direction (also referred to as the depth direction), so only pooling and upper pooling on a single channel is shown in the figure.

As shown, the left is a maximum pooling operation diagram, where the size of the input image 510 is 4 x 4, the pooling window size is 2 x 2, and the pooling step size is (2, 2). That is, according to the step length of 2 in the XY direction, the maximum value is selected from each 2×2 pooled window as the value of the window, such as the value selected by the dark square in the figure. The selected maximum value in each pooled window constitutes an output image 520, which is 2 x 2 in size. At the same time, the position of the maximum in the input image, i.e. the pooling index 530, needs to be recorded.

The right side of the figure shows a paired upper pooling operation diagram, the input information of the upper pooling layer comprising a 2 x 2 input image 540, and a pooling index 530. The upper pooling layer uses the pooling index 530 to restore the position of each element in the input image 540 as indicated by the index to the size 4 x 4 before pooling, while the rest of the positions are zero-padded, thereby obtaining an output image 550 after upper pooling.

Fig. 6 shows a schematic diagram of the operation of the deep convolutional layer. The depth convolution differs from the standard convolution in that the depth direction, referred to herein as the input channel, is not accumulated. In standard convolution, each convolution kernel needs to be calculated and accumulated with all layers (input channels) of the input feature map, so the number of input channels of each convolution kernel is equal to the number of input channels of the input feature map. While each convolution kernel in the depth convolution is a single channel, one convolution kernel is responsible for each channel, and one channel is only convolved by one convolution kernel.

As shown, the dimension of the input feature map 610 is 12×12×3, i.e., includes three channels, each comprising a 12×12 image. In this depth convolution, 3 convolution kernels 620 are used, each of which is single-channel, e.g., 5×5×1 in size. Each convolution kernel convolves only one channel of the input signature 610, and such convolves yield outputs of size 8 x 1 each time, which are then stacked together to create an 8 x 3 image, ultimately yielding an output signature 630 of size 8 x 3. As can be seen from the figure, the depth (number of channels) of the output profile remains consistent with the input profile.

The principles of operation of the upper pooling layer and the deep convolution layer are described above. In the conventional neural network model, the operations among the network layers are relatively independent and sequentially executed, that is, after the previous layer of operation is completed, the operation result is restored to the off-chip storage circuit, and when the latter layer of operation is executed, the data is read from the off-chip storage circuit, and then the corresponding operation is executed.

FIG. 7 illustrates an exemplary operation of the upper pooling layer before fusion. In the example of fig. 7, the parameters for upper pooling are assumed to be as follows: the pooling window size is 2×2, the pooling step size is equal to the pooling window size, also 2×2, and the filling of the pooling result is (1, 0,1, 0), i.e. one row is filled above, one column is filled on the left, and no filling is performed below and on the right. In addition, the pooling index is uniform across the individual pooling windows, all shown in the upper left corner.

As shown, the input feature map 710 is 2 x 2 in size, the pooling window is also 2 x 2, the corresponding pooling index 720 is 4 x 4, and the index in each window is fixed to 0 (i.e., the upper left corner in the window). After performing the pooling operation on the input feature map 710 according to the pooling index 720, an output feature map 730 is obtained, where the dimension size is 5×5, the gray portion is a filling portion, and the data in the input feature map is respectively filled in the positions indicated by the corresponding pooling index in the output feature map, and the rest positions are zero-filled.

When the operation to the upper pooling layer is performed without merging, the input (e.g., input feature map 710 and pooling index 720) of the upper pooling layer is first read from the off-chip storage circuit (e.g., storage 204, dram of fig. 2) to the on-chip storage circuit, e.g., NRAM 331 of fig. 3 stores the input feature map 710, WRAM 332 of fig. 3 stores the pooling index 720.

Next, an arithmetic circuit (e.g., vector arithmetic unit 321 or matrix arithmetic unit 322 in fig. 3) fetches numbers from NRAM 331 and WRAM 332 to complete the operation, and writes the upper pooled result (e.g., output feature map 730) back to NRAM 331.

Finally, the processor writes the result of the up-pooling back from NRAM 331 into off-chip memory circuitry (e.g., memory device 204, dram of fig. 2) as input to the next neural network layer.

Fig. 8 shows an exemplary operation of the depth convolution layer before fusion. In the example of fig. 8, the deep convolution layer is the next layer of the upper pooling or the like of fig. 7, which performs a deep convolution on the upper-pooled input feature map. The parameters of the assumed depth convolution are as follows: the convolution kernel size is 2×2, the convolution step size is 1×1, and the padding is (0, 0).

As shown, the input signature 810, i.e., the output signature 730 of fig. 7, has a size of 5×5 and the convolution kernel 820 is 2×2. After performing a deep convolution on the input feature map 810 using the convolution kernel 820, an output feature map 830 is obtained, whose dimension size is 4×4.

When the operation to the deep convolutional layer is not performed, the output of the previous layer (e.g., the upper pooling layer of fig. 7) is first read from the off-chip memory circuit (e.g., the memory device 204 of fig. 2, dram) as an input (e.g., the input signature 810) to the on-chip memory circuit, such as NRAM 331 in fig. 3, to store the input signature 810. In addition, the convolution kernel 820 of the deep convolution layer is also read from the off-chip memory circuit into WRAM 332 in fig. 3.

Next, an arithmetic circuit (e.g., vector arithmetic unit 321 or matrix arithmetic unit 322 in fig. 3) fetches the numbers from NRAM 331 and WRAM 332 to complete the operation, and writes the result of the depth convolution (e.g., output feature map 830) back to NRAM 331.

Finally, the processor writes the result of the deep convolution back from NRAM 331 to the off-chip memory circuit (e.g., memory device 204, dram of fig. 2) as input to the next neural network layer.

As can be seen from the operational procedure described in connection with fig. 7 and 8, the sequential execution of the various network layers unnecessarily increases the data handling of the same piece of data (in this example, the output profile 730 of the upper pooling layer, which is the input profile 810 of the deep convolutional layer) from the on-chip memory circuit to the off-chip memory circuit, and from the off-chip memory circuit to the on-chip memory circuit again, thereby increasing the pressure of data access.

In view of this, the disclosed embodiments propose a fusion scheme of an upper pooling layer and a deep convolution layer that fuses the operations of two network layers together, thereby avoiding the back and forth handling of intermediate data between off-chip and on-chip memory circuits and reducing the data access pressure. Further, in some embodiments, the operation performance may be further improved by analyzing the operation characteristics of the two network layers, adjusting the operation process, and effectively fusing them.

As can be seen from the above-described operation procedures of the upper pooling layer and the depth convolution layer, since zero is filled in a position other than the position indicated by the pooling index in the upper pooling procedure, the final result of the depth convolution operation is only related to the operation result of the input feature map (i.e., the non-zero input data element) of the upper pooling layer, and therefore, the multiply-add operation with the convolution kernel can be directly performed with respect to the input feature map of the upper pooling layer. In addition, when only one non-zero data element exists in the receptive field corresponding to each convolution output point, accumulation is not needed between calculation results. Further, as can be seen from the output feature map 830 of fig. 8, the order of the multiplication results in the output feature map is related to the pooled index, and the final result can be obtained only by adjusting and reordering according to the pooled index.

Accordingly, in a fusion operation scheme of embodiments of the present disclosure, a fusion operation of upper pooling and depth convolution may include two steps: calculating the operation result (such as a product result) of the convolution kernel and the upper pooled input data; and rearranging the operation results according to the pooled index to obtain a final fused operation result.

FIG. 9 illustrates a fusion operation of an upper pooling layer and a deep convolutional layer of an embodiment of the present disclosure.

As shown in fig. 9, the on-chip memory circuit may first be loaded with the input data of the upper pooling layer and the convolution kernel of the deep convolution layer from the off-chip memory circuit. The input feature map 910 of the upper pooling layer has a size of 2×2×c, where C represents the channel dimension, also referred to as the depth direction. The size of the convolution kernel 920 of the deep convolution layer is also 2×2×c.

In operation, the operation result of the convolution kernel and the upper pooled input data may be divided into a plurality of rounds to perform operation, so as to perform para-multiplication operation on each weight vector in the depth direction in the convolution kernel 920 and the input vector in the depth direction of the input feature map 910, so as to obtain a plurality of result vectors in the depth direction.

For example, a weight vector of a height H and a width W may be fixed in the channel dimension/depth direction, for example, a weight vector a with H and W both being 1, which are represented by dark portions in 920, and the weight vector a performs a vector multiplication operation, that is, a bit multiplication operation, on each input vector in the depth direction of the input feature map 910, so as to obtain 4 result vectors shown in 930. Similarly, in the next round of operation, the weight vector b may be vector multiplied with each input vector, respectively, resulting in 4 result vectors 940. The result vectors 950 and 960 corresponding to the weight vectors c and d are also obtained.

The result vectors are then reordered according to the indication of the pooling index of the upper pooling layer to obtain a final fused operation result 970.

As shown, the operation for the 1 st weight vector a in fig. 9: a 1,2,3, 4= [1 a,2 a,3 a,4 a ], it is understood that 1,2,3 and 4 are located at the 1 st position of the convolution kernel, respectively, and result is a product. Similarly, the operation for the 2 nd weight vector b in fig. 9: b 1,2,3,4 = [1 x b,2 x b,3 x b,4 x b ], it is understood that the product results when 1,2,3 and 4 are located at the 2 nd position of the convolution kernel, respectively. Thus, the final position of each result vector may be determined in conjunction with the pooled index.

Specifically, in some embodiments, an index of each input vector is determined based on the pooled index; and then according to the index mapping relation, determining the index of the corresponding result vector according to the index of the input vector. Finally, the result vectors can be rearranged according to the index order of the result vectors to obtain the fusion operation result.

The pooled index may have a variety of representations, which may be one-dimensional or two-dimensional, as the disclosure is not limited in this respect. The indices of different dimensions may be converted from each other, for example, according to a predetermined traversal rule, converting a two-dimensional index into a one-dimensional index, and vice versa. Taking the pooled index 720 in fig. 7 as an example, for a total of 16 positions, the position indexes with non-zero data (dark squares) are (1, 1), (1, 3), (3, 1) and (3, 3), respectively, which may also be denoted as one-dimensional indexes, 0, 2, 8 and 10, respectively. In some embodiments, the pooling index of each pooling window in the upper pooling layer is consistent, e.g., in the upper left corner in the examples of fig. 7 and 9. In these embodiments, the pooled index may only need one index representation in the pooled window, e.g., 0 represents the upper left corner position in the 2×2 pooled window.

From the pooled indices, the pooled index of each input vector may be determined correspondingly, e.g., the pooled index of input vector "1" in input feature map 910 is 0, the pooled index of input vector "2" is 2, the pooled index of input vector "3" is 8, and the pooled index of input vector "4" is 10.

The index of the corresponding result vector may then be determined from the index of the input vector according to the index mapping. Each result vector is obtained by multiplying a weight vector in the convolution kernel by an input vector, so that the index mapping relationship indicates the relationship between the position of the weight vector and the index of the input vector and the corresponding result vector in the final deep convolution operation result, in other words, the index of the vector result multiplied by the weight vector can be determined based on the position of the weight vector in the convolution kernel and the index of the input vector.

In some embodiments, padding (padding) operations exist for the upper pooling operations and/or the deep convolution operations. For example, in the target detection algorithm based on the point cloud data, the same padding (same padding) is required, that is, the shape of the input data is made the same as the shape of the output data after the convolution operation. It will be appreciated that in other convolution operation scenarios, different padding rules may exist. In some embodiments of the present disclosure, the padding of the upper pooling layer is (1, 0,1, 0), i.e., 1 is added on top and left, no padding is added on bottom and right, and the padding of the deep convolutional layer is 0.

When padding is present, it has an impact on the index of the input vector. Referring again to fig. 8, wherein the gray portion of the input profile 810 represents the padding area. As can be seen from the figure, any data point (x, y) in the initial input data whose coordinates in the input data after padding become (x+pad_left, y+pad_top), where pad_left is the left fill level and pad_top is the upper fill level. It is understood that the index of the input data can be adjusted by a simple addition according to the padding rule.

Thus, in these embodiments, the index of the input vector may be adjusted based on the padding rules of the upper pooling layer and the depth convolution layer before the index of the result vector is determined according to the index mapping relationship. Those skilled in the art will appreciate that the index adjustment process may also occur after or during index mapping, provided that the effects of the shim rules are taken into account, as the embodiments of the present disclosure are not limited in this respect.

Fig. 10 exemplarily shows the weight vector and the index mapping relationship between the input vector and the result vector. The padding rules have been considered in this example as indicated by the grey part. 1010, 1020, 1030, and 1040 in the figure represent the product operation of the 1 st weight vector a and the corresponding input vector, respectively. The arrows in the figure indicate the corresponding positions of the respective product operation results in the convolution operation results 1050.

Specifically, the result of the multiplication of the input vector "1" (the pooled index is (2, 2)) and the weight vector a indicated at 1010 corresponds to the (2, 2) position in the convolution result of 4×4, the result of the multiplication of the input vector "2" (the pooled index is (2, 4)) and the weight vector a indicated at 1020 corresponds to the (2, 4) position in the convolution result, the result of the multiplication of the input vector "3" (the pooled index is (4, 2)) and the weight vector a indicated at 1030 corresponds to the (4, 2) position in the convolution result, and the result of the multiplication of the input vector "4" (the pooled index is (4, 4)) and the weight vector a indicated at 1040 corresponds to the (4, 4) position in the convolution result.

As can be seen from the figure, when the input vector and the 1 st weight vector of the convolution kernel are multiplied, the index has the following mapping relation: assuming that the index of the input vector is (x, y), the index of the result vector of the multiplication operation with the 1 st weight vector is also (x, y).

Similarly, the relationship of the index of the result vector to the index of the input vector when the input vector is multiplied by other weight vectors can be deduced. For example, when the input vector is multiplied by the 2 nd weight vector of the convolution kernel, the index has the following mapping relationship: assuming that the index of the input vector is (x, y), the index of the result vector of the multiplication operation with the 2 nd weight vector is (x, y-1). When the input vector and the 3 rd weight vector of the convolution kernel are multiplied, the index has the following mapping relation: assuming that the index of the input vector is (x, y), the index of the result vector of the multiplication operation with the 3 rd weight vector is (x-1, y).

In summary, each input vector traverses the 2×2 convolution kernel in turn, so the offset of the input vector relative to the center point of the convolution kernel is fixed. From this characteristic, the index of the vector product result associated with each input vector may be determined directly based on the index of that input vector. That is, the index of the product result of multiplying the input vector by the all weight vector can be determined only by knowing the index of the input vector.

Therefore, according to the index of the input vector, the index of the corresponding center point of the convolution kernel when the index traverses the convolution kernel can be sequentially solved based on the fixed coordinate offset. And then mapping the index of the center point to the index of the output point to determine the index of each vector product result generated by the input vector.

Furthermore, in some embodiments, some vector product results may overflow the convolution result range, belonging to invalid results. For example, when the pooling index is 3, i.e., the input vector is in the lower right corner of the pooling window, some result vectors will be out of the convolution result range. For these cases, an index that exceeds the range of convolution results (i.e., the range of output feature map dimension sizes) may be set to a predetermined value, such as-1, so that these invalid results are identified in the rearrangement processing and are not subjected to the rearrangement processing.

After the indexes of the result vectors are determined, the result vectors may be rearranged in the order of the indexes of the result vectors to obtain a fusion operation result.

For example, according to the index mapping relationship of fig. 10, the indexes of the four result vectors in 930 of fig. 9 are (2, 2), (2, 4), (4, 2), and (4, 4), respectively, which are arranged at corresponding positions in the final fusion operation result 970, respectively. Other result vectors are similarly reordered.

The arithmetic circuitry may then write the fused arithmetic result 970 back to an on-chip memory circuit, such as NRAM 331 in fig. 3. Finally, the fusion operation result can be further output from the on-chip memory circuit to the off-chip memory circuit.

The fusion operation of upper pooling and deep convolution of embodiments of the present disclosure is described in detail above in connection with the accompanying drawings. It will be appreciated that in the above example, since the pooling indexes within the individual pooling windows are identical and the convolution kernel is the same size as the pooling window, there is at most one input vector within the receptive field of each convolution output point, and therefore no accumulation between the result vectors is required. When the pooled indexes in the pooled windows are inconsistent, multiple input vectors may appear in the receptive field of each convolution output point, and vector alignment accumulation is needed between the result vectors with the same indexes after index mapping. Other operations are the same as those described above, and will not be repeated here.

Embodiments of the present disclosure also provide an artificial intelligence processor for executing a neural network model, and a method of executing a neural network model implemented by the artificial intelligence processor. The neural network model includes at least an upper pooling layer and a deep convolution layer.

FIG. 11 illustrates a schematic block diagram of an artificial intelligence processor in which embodiments of the present disclosure may be implemented. As shown in FIG. 11, the artificial intelligence processor 1100 includes a control circuit 1110, an arithmetic circuit 1120, and an on-chip memory circuit 1130. The roles and functions of the control circuit, the arithmetic circuit, and the on-chip memory circuit are similar to those of the control module, the arithmetic module, and the memory module described in fig. 3, and will not be described in detail herein.

In some embodiments, the control circuitry 1110 may be configured to control loading of input data of an upper pooling layer of the neural network model and a convolution kernel of a deep convolution layer from the off-chip storage circuitry to the on-chip storage circuitry 1130. The operation circuit 1120 may be configured to perform a fusion operation of the upper pooling layer and the deep convolution layer of the embodiments of the present disclosure with respect to the input data and the convolution kernel, and write the fusion operation result back to the on-chip memory circuit 1130. The control circuit 1110 may be further configured to control outputting the fusion operation result from the on-chip memory circuit 1130 to the off-chip memory circuit.

In some embodiments, the operational circuitry 1120 may include multiplication circuitry 1122 and rearrangement circuitry 1124.

The multiplication circuit 1122 may be configured to perform a bit-wise multiplication on each weight vector in the depth direction in the convolution kernel and an input vector in the depth direction of the input data, to obtain a plurality of result vectors in the depth direction.

In some embodiments, multiplication circuit 1122 may include a plurality of vector multipliers 1123. In operation, the operation circuit 1120 may distribute respective input vectors in the depth direction of the input data to a plurality of vector multipliers 1123, for example, each vector multiplier 1123 distributes one input vector. The arithmetic circuit 1120 may also broadcast the weight vector in the depth direction in the convolution kernel to the plurality of vector multipliers 1123, for example, broadcast the weight vector a, the weight vector b, and the like to all the vector multipliers in sequence. At this time, each vector multiplier 1123 may be configured to perform a bit multiplication operation on the broadcasted weight vector and the distributed input vector, resulting in a result vector.

The rearrangement circuit 1124 may be configured to rearrange the multiple result vectors obtained by the multiplication circuit 1122 according to the pooling index of the upper pooling layer to obtain the fusion operation result.

In some embodiments, the rearrangement circuit 1124 may be further configured to: determining an index of each input vector based on the pooled index; according to the index mapping relation, determining the index of the corresponding result vector according to the index of the input vector; and rearranging the result vectors according to the index sequence of the result vectors to obtain a fusion operation result.

In some embodiments, the reordering circuitry may adjust the index of the input vector based on padding rules of the upper pooling layer and the depth convolution layer before determining the index of the result vector according to the index mapping relationship.

FIG. 12 illustrates an exemplary flowchart of a method implemented by an artificial intelligence processor to execute a neural network model, according to an embodiment of the disclosure.

Specifically, in step 1210, the control circuitry controls loading the on-chip memory circuitry with the input data of the upper pooling layer and the convolution kernel of the deep convolution layer from the off-chip memory circuitry.

Next, in step 1220, the arithmetic circuitry performs a fusion operation of the upper pooling layer and the deep convolution layer with respect to the input data and the convolution kernel, and writes the fusion operation result back to the on-chip memory circuitry. The specific operation process of the operation circuit has been described in detail above with reference to the accompanying drawings, and will not be described in detail here.

Finally, in step 1230, the control circuitry further controls outputting the fusion operation result from the on-chip memory circuitry to the off-chip memory circuitry.

Those skilled in the art will appreciate that the description of the fusion operation process with the upper pooling layer and the deep convolution layer of the embodiments of the present disclosure described above in connection with the drawings may be equally applied to the artificial intelligence processor of fig. 11 and the method of fig. 12, and thus a repetitive description will not be made.

The present disclosure also provides a chip that may include an artificial intelligence processor of any of the embodiments described above in connection with the accompanying drawings. Further, the present disclosure also provides a board that may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a PC device, an internet of things terminal, a mobile terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus. The electronic device or apparatus of the present disclosure may also be applied to the internet, the internet of things, data centers, energy sources, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical, and the like. Further, the electronic device or apparatus of the present disclosure may also be used in cloud, edge, terminal, etc. application scenarios related to artificial intelligence, big data, and/or cloud computing. In one or more embodiments, a computationally intensive electronic device or apparatus according to aspects of the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power consuming electronic device or apparatus may be applied to a terminal device and/or an edge device (e.g., a smart phone or camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and collaborative work of an end cloud entity or an edge cloud entity.

It should be noted that, for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of actions and combinations thereof, but those skilled in the art will understand that the aspects of the present disclosure are not limited by the order of actions described. Thus, one of ordinary skill in the art will appreciate in light of the present disclosure or teachings that certain steps thereof may be performed in other sequences or concurrently. Further, those skilled in the art will appreciate that the embodiments described in this disclosure may be considered alternative embodiments, i.e., wherein the acts or modules involved are not necessarily required for the implementation of some or some aspects of this disclosure. In addition, the description of some embodiments of the present disclosure is also focused on, depending on the scenario. In view of this, those skilled in the art will appreciate that portions of one embodiment of the disclosure that are not described in detail may be referred to in connection with other embodiments.

In particular implementations, based on the disclosure and teachings of the present disclosure, one of ordinary skill in the art will appreciate that several embodiments of the disclosure disclosed herein may also be implemented in other ways not disclosed herein. For example, in terms of the foregoing embodiments of the electronic device or apparatus, the units are split in consideration of the logic function, and there may be another splitting manner when actually implemented. For another example, multiple units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of the connection relationship between different units or components, the connections discussed above in connection with the figures may be direct or indirect couplings between the units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustical, magnetic, or other forms of signal transmission.

In the present disclosure, elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physically separate. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, some or all of the units may be selected to achieve the objectives of the embodiments of the disclosure, as desired. In addition, in some scenarios, multiple units in embodiments of the disclosure may be integrated into one unit or each unit may physically exist alone.

In other implementation scenarios, the integrated units may also be implemented in hardware, i.e. as specific hardware circuits, which may include digital circuits and/or analog circuits, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, devices such as transistors or memristors. In view of this, various types of devices described herein (e.g., computing devices or other processing devices) may be implemented by appropriate hardware processors, such as central processing units, GPU, FPGA, DSP, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which may be, for example, variable resistance memory (Resistive Random Access Memory, RRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), static random access memory (Static Random Access Memory, SRAM), enhanced dynamic random access memory (Enhanced Dynamic Random Access Memory, EDRAM), high bandwidth memory (High Bandwidth Memory, HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM, RAM, etc.

The foregoing has described in detail embodiments of the present disclosure, with specific examples being employed herein to illustrate the principles and implementations of the present disclosure, the above examples being provided solely to assist in the understanding of the methods of the present disclosure and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the present disclosure, there are variations in the detailed description and the scope of the application, which in light of the foregoing description should not be construed to limit the present disclosure.

Claims

1. An artificial intelligence processor that executes a neural network model comprising an upper pooling layer and a deep convolution layer, comprising control circuitry, operational circuitry, and on-chip storage circuitry, wherein:

the control circuit is used for controlling the loading of the input data of the upper pooling layer and the convolution kernel of the depth convolution layer from the off-chip storage circuit to the on-chip storage circuit;

the operation circuit is used for executing fusion operation of the upper pooling layer and the depth convolution layer aiming at the input data and the convolution kernel, and writing a fusion operation result back to the on-chip storage circuit; and

the control circuit is further used for controlling the on-chip storage circuit to output the fusion operation result to the off-chip storage circuit;

Wherein the arithmetic circuit includes:

the multiplication circuit is used for performing para-multiplication operation on each weight vector in the depth direction in the convolution kernel and the input vector in the depth direction of the input data respectively to obtain result vectors in a plurality of depth directions; and

and the rearrangement circuit is used for rearranging the plurality of result vectors according to the pooling index of the upper pooling layer so as to obtain the fusion operation result.

2. The artificial intelligence processor of claim 1, wherein the rearrangement circuit is further configured to:

determining an index of each of the input vectors based on the pooled index;

according to the index mapping relation, determining the index of the corresponding result vector according to the index of the input vector; and

and rearranging the result vectors according to the index sequence of the result vectors to obtain the fusion operation result.

3. The artificial intelligence processor of claim 2, wherein the index mapping relationship further indicates that an index that exceeds a range of output data dimension sizes is set to a predetermined value.

4. The artificial intelligence processor of claim 2, wherein the rearrangement circuit is further configured to:

Before the index of the result vector is determined according to the index mapping relation, the index of the input vector is adjusted based on filling rules of the upper pooling layer and the depth convolution layer.

5. The artificial intelligence processor of any one of claims 1-4 wherein the pooling index of each pooling window in the upper pooling layer is uniform.

6. The artificial intelligence processor according to any one of claims 1-4 wherein the multiplication circuit comprises a plurality of vector multipliers, and

the arithmetic circuit is further configured to: distributing each input vector in the depth direction of the input data to the plurality of vector multipliers, and broadcasting and transmitting weight vectors in the depth direction in the convolution kernel to the plurality of vector multipliers;

each of the vector multipliers is configured to: and performing a para-multiplication operation on the broadcasted weight vector and the distributed input vector to obtain a result vector.

7. A chip comprising an artificial intelligence processor according to any one of claims 1-6.

8. A board card comprising the chip of claim 7.

9. A method of executing a neural network model using the artificial intelligence processor of any one of claims 1-6.