CN116611488A

CN116611488A - Vector processing unit, neural network processor and depth camera

Info

Publication number: CN116611488A
Application number: CN202310598167.1A
Authority: CN
Inventors: 支元祥
Original assignee: Orbbec Inc
Current assignee: Orbbec Inc
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-08-18

Abstract

The application is applicable to the technical field of chips, and relates to a vector processing unit for a neural network processor, the neural network processor and a depth camera. The precision lifting module is used for lifting the input first integer data to the first floating point data; the inverse quantization module is used for carrying out inverse quantization on the first floating point type data to obtain inverse quantization data, wherein the inverse quantization data is floating point type data; the operation module is used for carrying out operation on the inverse quantization data to obtain second floating point type data; the quantization module is used for quantizing the second floating point type data to obtain quantized data, wherein the quantized data is floating point type data; the precision reducing module is used for reducing precision of the quantized data to obtain second integer data. The operation in the vector processing unit is performed by floating point data, so that the algorithm precision of the neural network processor is ensured.

Description

Vector processing unit, neural network processor and depth camera

Technical Field

The application belongs to the technical field of chips, and particularly relates to a vector processing unit (Vector Processing Unit, VPU), a neural network processor and a depth camera.

Background

Convolutional neural networks are computing systems based on biological neural networks, and have wide applications in the fields of machine learning, pattern recognition, and the like as a co-product of biological science and computer science. In order to solve the learning problem of more complex abstraction, the scale of convolutional neural networks is larger and larger, and some large-scale neural networks, such as Google Cat system networks, have about 10 hundred million neuron connections, and the complexity of data volume and computation is also increased. Therefore, how to design a convolution Neural network forward reasoning hardware acceleration architecture with high performance and low power consumption under a limited area, such as a Neural network processor (Neural-Network Processing Unit, NPU) and the like, is a research hot spot today.

At present, in order to greatly reduce the cost of area and power consumption, the data used by the internal computing unit of many neural network processors are integer data or custom floating point data, and in this way, the accuracy of the algorithm is greatly lost.

Disclosure of Invention

In view of the above, the embodiment of the application provides a vector processing unit, a neural network processor and a depth camera, which can solve the technical problem that the algorithm precision of the neural network processor is lost in the related art.

In a first aspect, an embodiment of the present application provides a vector processing unit for a neural network processor, where the vector processing unit includes an accuracy raising module, an inverse quantization module, an operation module, a quantization module, and an accuracy lowering module. The precision lifting module is used for lifting the input first integer data to the first floating point data; the inverse quantization module is used for carrying out inverse quantization on the first floating point type data to obtain inverse quantization data, wherein the inverse quantization data is floating point type data; the operation module is used for carrying out operation on the inverse quantization data to obtain second floating point type data; the quantization module is used for quantizing the second floating point data to obtain quantized data; the precision reducing module is used for reducing precision of the quantized data to obtain second integer data.

In a second aspect, an embodiment of the present application provides a neural network processor, including a vector processing unit, a first input buffer unit, a convolution unit, and a second input buffer unit according to the embodiment of the first aspect, where the convolution unit is configured to perform convolution processing on input data; the first input buffer unit transmits input data to the vector processing unit through a first path or a second path; the first path is that the first input buffer unit transmits input data to the convolution unit, the input data is output to the second input buffer unit after being convolved by the convolution unit, and the second input buffer unit inputs the vector processing unit, wherein the first integer data is data obtained after being convolved by the convolution unit; the second path is that the first input buffer unit directly transmits the input data to the vector processing unit, and the first integer data is the input data.

In a third aspect, an embodiment of the present application provides a depth camera, including a neural network processor according to an embodiment of the second aspect.

According to the vector processing unit, the first integer data is lifted to be the first floating point data, and then inverse quantization, operation, quantization and precision reduction are sequentially carried out on the first floating point data to obtain the second integer data, so that floating point data operation is adopted in the vector processing unit, algorithm precision is guaranteed, and further precision of a neural network processor is guaranteed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of a neural network processor architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an overall architecture of a vector processing unit according to an embodiment of the present application;

fig. 3 is a schematic diagram of a data path of a first path entering a vector processing unit in a neural network processor according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a sigmoid function curve and its second derivative curve according to an embodiment of the present application;

FIG. 5A is a schematic diagram of a sigmoid function provided by an embodiment of the present application;

FIG. 5B is a schematic diagram of a fitted curve of a sigmoid function provided by an embodiment of the present application;

FIG. 5C is a schematic diagram showing a sigmoid function curve and a sigmoid function fitting curve according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an operation principle of an on-chip memory module according to an embodiment of the present application;

FIG. 7A is a schematic diagram illustrating an address space allocation of an on-chip memory module according to some embodiments of the present application;

FIG. 7B is a schematic diagram illustrating an address space allocation of an on-chip memory module according to other embodiments of the present application;

FIG. 7C is a schematic diagram of an address space allocation of an on-chip memory module according to still further embodiments of the present application;

FIG. 8A is a schematic diagram of address allocation within each group of data in full OCH mode according to some embodiments of the present application;

FIG. 8B is a schematic diagram illustrating the address allocation within each group of data in a 2-fold resolution mode according to other embodiments of the present application;

FIG. 8C is a schematic diagram of address allocation within each group of data in a 4-fold resolution mode provided by further embodiments of the present application;

FIG. 9A is a schematic diagram of address allocation for an on-chip memory module according to some embodiments of the present application;

FIG. 9B is a schematic diagram illustrating address allocation of an on-chip memory module according to other embodiments of the present application;

FIG. 9C is a schematic diagram of address allocation for an on-chip memory module according to still further embodiments of the present application;

FIG. 10 is a schematic diagram of OCH data storage for each piece of data in 4-fold resolution mode according to an embodiment of the present application;

FIG. 11 is a schematic diagram showing the splicing of slice data in 4-fold resolution mode according to an embodiment of the present application;

fig. 12 is a schematic diagram of a data path of a second path entering a vector processing unit in a neural network processor according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Furthermore, in the description of the present application, the meaning of "plurality" is two or more. The terms "first" and "second" and the like are used solely to distinguish one from another and are not to be construed as indicating or implying a relative importance.

It will be further understood that the term "coupled" is to be interpreted broadly, and may be a fixed connection, a removable connection, or an integral body, for example, unless explicitly stated or defined otherwise; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.

In order to better explain the technical scheme of the application, the technical scheme of the application is explained and illustrated in detail below by combining some specific parameters. It should be understood that these parameters are merely preferred for ease of illustration and are not to be construed as specific limitations of the present application.

The application provides a neural network processor which can be applied to electronic equipment such as a depth camera, a mobile phone, a tablet personal computer, an intelligent door lock, a payment device and the like, and can be used as a coprocessor to be mounted on a central processing unit, and the central processing unit distributes tasks for the neural network processor.

Fig. 1 is a schematic architecture diagram of a neural network processor according to the present application, where the neural network processor includes a bus, a first input buffer unit 101, a convolution unit 102, a second input buffer unit 103, a vector processing unit 104, and an output buffer unit 105. The neural network processor performs data interaction with devices such as a central processing unit and the like through a bus, and realizes data interaction among all internal units. The first input buffer unit 101 is configured to store data input by the bus (for example, data output by a previous processing unit), and an output terminal of the first input buffer unit 101 is connected to the convolution unit 102 and the vector processing unit 104, where the data output by the first input buffer unit 101 may enter the vector processing unit 104 via two paths, that is, a first path (a path indicated by a dotted line in fig. 1) and a second path (a path indicated by a solid line in fig. 2). Under the first path, the data output by the first input buffer unit 101 first enters the convolution unit 102, the convolution module 102 performs convolution processing on the input data, then outputs the convolution result to the second input buffer unit 103, and the second input buffer unit 103 outputs the convolution result to the vector processing unit 104. Under the second path, the data output from the first input buffer unit 101 is directly input to the vector processing unit 104. For input vector processing units 104 by two different paths, there is a difference in the processing of input data by vector processing units 104.

It should be noted that, only a portion of the neural network processor relevant to the present application is shown in fig. 1, and the neural network processor may further include other processing units, etc., which are not limited herein.

The vector processing unit 104 is used in the neural network processor for processing and data storage and transmission of other operators than convolution. Specifically, the vector processing unit 104 is configured to perform processing such as precision raising, inverse quantization, function Activating (ACT) operator operation, pooling (Pooling) operator operation, global Pooling (Global-Pool) operator operation, eltwise operator operation, resize operator operation, binocular operator (SGM) operation, quantization, precision lowering, and the like on input data, and write the operation result into the output buffer unit 105 according to a data arrangement format required by a next processing unit connected to the vector processing unit in the neural network processor. The output buffer unit 105 may be a double rate synchronous dynamic random access memory (DDR SDRAM, double Data Rate Synchronous Dynamic Random Access Memory), and the output buffer unit 105 is an off-chip memory located outside the vector processing unit 104, and is configured to buffer data output by the vector processing unit 104 for input to a next processing unit.

Fig. 2 is a block diagram of a vector processing unit according to some embodiments of the present application, where the vector processing unit includes an up-precision module, a down-precision module, a quantization module, an inverse quantization module, and an operation module, and input data of the vector processing unit may be input by the second input buffer unit in the first path or directly input by the first input buffer unit in the second path. The data entering the vector processing unit under the first path and the second path are all first integer data, and the difference is that the data entering the vector processing unit under the first path is processed by the convolution unit. The precision lifting module is used for lifting the first integer data to the first floating point data, the inverse quantization module is used for inversely quantizing the first floating point data to obtain inverse quantization data, the operation module is used for operating the inverse quantization data to obtain second floating point data, the quantization module is used for quantizing the second floating point data to obtain quantization data, and the precision reducing module is used for reducing the precision of the quantization data to obtain the second integer data. Therefore, all data processing in the vector processing unit is based on floating point type data, and the calculation accuracy is ensured.

In some embodiments, the data type output by the first input buffer unit is int8 or int16, and since the calculation result of the convolution unit may exceed the data range of int8 or int16, the data overflow occurs to cause information loss, for this purpose, the convolution unit may perform data expansion on the data input by the first input buffer unit to avoid data overflow, where the expanded representation range is greater than the representation range of the original data type, for example, the data of int8 is expanded to int32 after convolution, and the data of int16 is expanded to int48 after convolution, so that the first integer data type input to the vector processing unit in the first path is int32 or int48, and the first integer data type input to the vector processing unit in the second path is int8 or int16. In addition, the data after the precision-increasing processing may exceed the range that can be represented by the original type data, so that the data overflows to cause information loss, and in this embodiment, the precision-increasing module is further configured to perform data type judgment on the first integer data, where data of a non-preset type needs to be expanded to accommodate more data. Assuming that int48 is a preset type, when the first integer data input into the vector processing unit are of the types int8, int16 and int32, the precision increasing module is further configured to perform bit expansion on the first integer data, and expand the data of the types int8, int16 and int32 into the type int 48.

Specifically, when the first integer data is of the type int8, int16 or int32, the precision increasing module is configured to supplement sign bits before the most significant bit of the first integer data until the first integer data is expanded to 48 bits, and then perform precision increasing on the first integer data to convert the type int48 into the type fp 32. More specifically, the sign bit (the most significant bit) of int48 is fp32, the complement of int48 is obtained, and the signed number int48 is changed into an unsigned number uint48; searching the first 1 position of the uint48 from the high order to obtain a exponent and a decimal place; rounding processing, for example, rounding processing in a Round-to-even (Round-to-even) mode is performed to obtain first floating point type data; and the rounding mode from rounding to even number is adopted, so that the average error caused by rounding upwards or downwards is eliminated, and the operation precision of the vector processing unit is improved.

After the first integer data is subjected to precision lifting, first floating point data is obtained, and the next operation and processing can be performed after the first integer data is subjected to inverse quantization by an inverse quantization module. In some embodiments, the inverse quantization module includes a floating-point type multiplier for multiplying the first floating-point type data by a given inverse quantization parameter to obtain the inverse quantized data, which remains floating-point type data. The first integer data in the first path is a result of the first input buffer unit directly transmitting the input data to the convolution unit and convolving the input data by the convolution unit, so that the inverse quantization parameter (scale) in the first path is a product of an inverse quantization parameter (scale_feature) required for the input data and an inverse quantization parameter (scale_kernel) of a weight of the convolution unit, that is, scale=scale_feature×scale_kernel. Under the second path, the first integer data is the input data directly input by the first input buffer unit, and then the second path down-dequantization parameter is the dequantization parameter scale_feature required by the input data.

As shown in fig. 2, the operation module includes an adder for adding data, a multiplier for multiplying data, and a comparator for comparing data to obtain a maximum value of a plurality of data. In the embodiment shown in FIG. 3, the multiplier, adder, and comparator are each illustrated as single-precision floating-point fp 32. In other embodiments, half-precision floating-point (fp 16 or bfp 16) multiply, add, and compare operators may be used, or some custom floating-point multiply, add, and compare operators may be used. The operation module can combine at least one of the adder, the multiplier and the comparator to realize at least one of the activate function operator operation, the Eltwise operator operation, the reserve operator operation, the pooling operator operation, the global pooling operator operation and the binocular operator operation.

In some embodiments, the vector processing unit further comprises an on-chip memory module (Random Access Memory, RAM) for storing the first integer data. Further, the on-chip storage module is further configured to write the second integer data into the output buffer unit according to a data type and an arrangement format required by a next processing unit connected to the vector processing unit, so that the next processing unit can call the second integer data conveniently.

The embodiments shown in fig. 3 to 11 illustrate the operation principle of the vector processing unit under the first path, and the operation principle of the vector processing unit under the first path will be described below with the embodiments shown in fig. 3 to 11.

FIG. 3 is a schematic diagram of a data path of a vector processing unit in a first path in which the operation processing of an operation module is illustrated with an example of activating a function operator operation.

And operating the inverse quantization data through activating a function operator. The vector processing unit may support two broad classes of activation functions, one class is RELU-dependent activation functions, and the other class is activation functions with exponential functions (e.g., sigmoid).

Activation functions of the first type: when the RELU related activation function is realized, firstly, judging the data range of input data x, wherein different ranges have different slopes k, and then multiplying x and corresponding k as two inputs of a floating-point multiplier to realize activation.

Activation functions of the second type: an activation function with an exponential function shape. The hardware is used for realizing the function with the exponential function, and the following four methods are mainly adopted: the first is a lookup table, and with the improvement of calculation accuracy, the required storage resources of the method are obviously increased, and the resource consumption is high; the second is Taylor series expansion, and the number of multipliers and adders can be greatly increased when the accuracy requirement of the method is high; thirdly, a CORDIC algorithm (namely a coordinate rotation digital computing method), wherein the iteration times of the method are increased along with the increase of the precision, and the computing speed is reduced; the fourth is to fit it using the least squares method.

Besides supporting the two types of activation functions, the vector processing unit also improves the sigmoid function in the second type of activation function. In some embodiments, the activation function operation is implemented by a sigmoid function obtained by using a non-uniform piecewise polynomial fitting method, which is implemented with a small number of multipliers and adders. In one embodiment, a first order polynomial fit is performed on sigmoid in this embodiment, as follows:

specifically, the sigmoid function may be expressed approximately as a one-time function plus an error term for x's second order or more, where a represents the slope of the sigmoid function, b represents the intercept of the sigmoid function at x=0, O (x) ² ) As an error term, it is usually negligible when x approaches 0. The above approximation applies only to the case where x approaches 0, but when x is large, the error becomes large.

To further improve the sigmoid functionThe accuracy of the first order polynomial fit is performed, and the present embodiment performs non-uniform segmentation computation on the first order polynomial. Specifically, the sigmoid function is subjected to taylor expansion, and the second derivative of the sigmoid function determines the accuracy. As shown in fig. 4, the upper curve in fig. 4 is a sigmoid function curve, the lower curve diff2 is a second derivative curve of the sigmoid function, and the second derivative of the sigmoid function determines the fitting accuracy, that is, the larger the absolute value of the second derivative, the larger the error. Therefore, the degree of intensity of the piecewise interval of piecewise fitting needs to be proportional to the absolute value of the second derivative, namely, the finer the piecewise interval of the area with the larger absolute value of the second derivative; the smaller the absolute value, the more sparse the segment interval of the region. As can be seen from FIG. 4, the region where the absolute value of the second derivative is large is concentrated at [ -2, -0.5]Or [0.5,2 ]]The segmentation of the region is then more dense.

Since the sigmoid function is symmetric about point (0, 0.5), to reduce the segmentation interval, only points of x >0 are fit, and points of x <0 can be calculated according to symmetry, and the fitted sigmoid expression is as follows:

in order to accelerate the calculation speed, a lookup table can be constructed by combining a, b and 1-b with different x values, and then, table lookup is only needed to be carried out on three values of a, b and 1-b according to the x values in the segmentation interval, so that sigmoid function operation is realized. Because the input of the sigmoid function is floating point data, the segmentation interval for fitting the sigmoid function is also needed to be segmented according to the floating point data, thereby facilitating the table lookup.

In the embodiment, a sigmoid activation function is obtained by adopting a non-uniform segmentation first-order polynomial fitting method, so that the defects of multiple iteration times, low calculation speed, low precision, high resource expense and the like when the traditional hardware realizes an exponential function are overcome; on the other hand, the segmentation interval is adjustable, so that the precision is controllable. Further, as shown in fig. 5, a real sigmoid function curve is compared with a sigmoid function curve obtained by fitting, fig. 5A is a sigmoid function curve sigmoid (x), fig. 5B is a fitting curve polyfit (x) of the sigmoid function, and fig. 5C is a curve obtained by fitting the sigmoid function. It can be seen that the sigmoid function is implemented by using a non-uniform piecewise polynomial fitting method, with a small fitting error, which in one embodiment is less than 5×10 ^-5 。

In other embodiments of the present application, the sigmoid fit function may also reduce the segmentation interval by a higher order fit, requiring a greater number of multipliers and adders than the first order fit, and may be specifically selected based on the number of multipliers and adders.

In other embodiments, there may be no activation function after convolution in the network, and the activation function (bypassACT) may be skipped, where the bias (bais) is typically added, and if not, 0 is added and then output to the quantization module.

The data after the activation of the function operator or the data after the bias treatment are still floating point type data, the data are quantized by a quantization module according to the quantization parameter of the next processing unit in the neural network processor, then the precision is reduced by a precision reducing module according to the data type required by the next processing unit, the conversion of the data type is realized, and then the data are buffered by an on-chip storage module. In one embodiment, the quantization module includes a floating point type multiplier.

The description is given with the type of data required for the next processing unit in the neural network processor connected to the vector processing unit being either int8 or int16. The precision reducing module is specifically used for: separating sign bit, index bit and decimal bit; shifting decimal places according to the size of the decimal places to obtain unsigned integers; rounding the unsigned number to obtain a uint8 or a uint16; and obtaining the complement of the uint8 or the uint16 according to the sign bit to obtain the signed number int8 or int16.

The on-chip storage module under the first path is mainly used for caching the operation result and writing the operation result back to the output buffer unit according to the data type and the data arrangement format required by the next processing unit. The on-chip storage module supports two types of input data and output data of int8 or int16; in order to meet the requirement of the next processing unit, the on-chip storage module can realize that Output data is arranged according to NC4HW, NC8HW, NC16HW or the like, and can realize full OCH (Output channel) mode, 2-time resolution parallel mode or 4-time resolution parallel mode Output.

Specifically, the full OCH mode refers to that the cached operation results are all transmitted to the next processing unit for processing; the 2-time resolution parallel mode refers to that a cached operation result is divided into two parts and transmitted to different processing units for processing; the 4-time resolution parallel mode refers to that an operation result is divided into four parts and transmitted to different processing units for processing.

In some embodiments, the on-chip memory module may include a plurality of memory subunits, where the memory units are obtained by combining the plurality of memory subunits, so that the on-chip memory module stores more operation results as possible and improves the read-write bandwidth. The on-chip memory module includes 32 memory subunits, and each memory subunit has a memory space with a depth of 32 bits x 1024. In the first path, the vector processing unit processes 1 column of 16 rows of data per cycle, as shown in fig. 6, and the on-chip memory module merges the 32-bit memory subunits into two pairs, and uses the two memory subunits as 16 memory units with a depth of 64 bits by 1024, where each memory unit is used to cache 1 row of data. The address space allocation of 16 memory cells is identical, and a memory cell is taken as an example.

Specifically, assuming that a 1024-depth memory cell performs ping-pong reading and writing at 512 depths, the input and output data types determine how many output channels of data can be stored in a memory space of 64 bits (bit) by 512 depths. In some embodiments, in order to improve the computing efficiency and reduce the memory usage, the data transmission requirement may be reduced, the data stored in the storage unit is divided into smaller blocks for processing, and different input/output data types mean different block sizes, so that the storage unit can store different OCH numbers at most. For example, when the input data type is int8, it is assumed that the size of each block is 16 (row) x 16 (width), that is, the data is processed by dividing the data into 16 rows and 16 columns of small blocks. When the input data type is int16, the size of each block is 16 (row) x 8 (width), i.e. the input data is divided into 16 rows and 8 columns of small blocks for processing.

The number of OCH that the memory cell can store at most is determined by the size of the memory space, the width of the input data, and the number of bits of the output data. In some embodiments, assuming the number of OCHs that can be stored at most is och_num, the combination of input data type and output data type is four: the input data type is int8, the output data type is int8, thenThe input data type is int8, the output data type is int16, thenThe input data type is int16, the output data type is int8, thenThe input data type is int16, the output data type is int16, then

In one embodiment, assuming that only a maximum of 128 output channel data can be included in the convolution unit, 128 OCH data is defined as a set of data. Thus, the maximum number of data sets that can be stored per memory cell of depth 512 and the address depth at which each set of data occupies the memory cell are shown in Table 1 below.

TABLE 1

Fig. 7A, 7B and 7C are schematic diagrams showing address space allocation of memory cells corresponding to the above four combinations of input data types and output data types. Wherein, fig. 7A corresponds to two cases of int8 (in) and int8 (out), int16 (in) and int16 (out); fig. 7B corresponds to int8 (in) and int16 (out); fig. 7C corresponds to int16 (in) and int8 (out); in and out represent input and output, respectively.

Each group of data has three different resolution working modes, namely a full OCH mode, a 2-time resolution mode and a 4-time resolution mode. In different modes, the number of pieces of data divided into each group and the number of OCHs contained in each piece of data are shown in table 2 below.

TABLE 2

Mode of operation	Number of sheets	Number of OCH included per tablet
			Full OCH	1	128
Resolution of 2 times	2	64
			4 times resolution	4	32

The address assignment within each group of data in full OCH mode, 2-fold resolution mode, and 4-fold resolution mode is shown in fig. 8A, 8B, and 8C, respectively. In full OCH mode, OCH0-OCH127 is allocated in one piece of data; 2 times resolution mode, OCH0-OCH63 is stored in one piece of data, OCH64-OCH127 is distributed in the other piece of data; in the 4-time resolution mode, OCH0-OCH31 is stored in a first piece of data, OCH32-OCH63 is distributed in a second piece of data, OCH64-OCH95 is stored in a third piece of data, and OCH96-OCH127 is distributed in a fourth piece of data.

The input data type, the output data type and the output data arrangement format determine the address allocation in each piece of data. In some embodiments, the output data arrangement format may be NC4HW, NC8HW or NC16HW, NC4HW, NC8HW or NC16HW being an optimization of the NCHW format, meaning that the data of 4, 8 or 16 consecutive output channels are grouped into a group, each group of data being stored in the format of NCHW, where N represents the number of output data and C represents the number of output channels; h represents the height of the output data; w represents the width of the output data. For example, when the data arrangement format output by the on-chip memory module is NC4HW. In the NC4HW format, the data is divided into several groups C4W, each group C4W representing a block of data containing 4 output channels and having a width W. Thus, when the output data type is int8, one group of C4W needs to occupy 32 bits (i.e. 4 bytes), two groups of C4W (C4W 2) form 64 bits, and when the output data type is int16, one group of C4W needs to occupy 64 bits (i.e. 8 bytes).

Taking the distribution of 64-bit data as an example, assuming that the number of OCH in each piece of data is 4n, and 4 OCH is a group to form one C4, the number of C4 in each piece of data is n, and the n C4 are numbered as C4 (0), C4 (1), C4 (2), … and C4 (n) in sequence. FIG. 9A shows on-chip address assignment when the output data type is int8 and the output data arrangement format is NC4 HW; fig. 9B shows the on-chip address allocation when the output data type is int8 and the output data arrangement format is NC8HW or NC16HW; fig. 9C shows on-chip address allocation when the output data type is int16 and the output data arrangement format is NC4HW, NC8HW or NC16 HW. Each C4W (or C4W (even), C4W (odd)) is arranged in the order of W, wherein C4W (even) and C4W (odd) represent two modes of rounding to even and rounding to odd, respectively.

The memory cell read address calculation is determined by the input data type, the output data arrangement format and the working mode. The input data type determines the size of the output data width W, the output data type and the output data arrangement format determine whether the read data is C4W, C4W2 or C8W, and the different modes determine the order of OCH data read. The chip data are read out according to the sequence of increasing OCH, in the full OCH mode, no matter whether the output data are arranged in NC4HW, NC8HW or NC16HW, the data are output according to the sequence of increasing OCH, and increasing OCH does not mean increasing the read address, and the read address is calculated according to the storage mode of the OCH data in the on-chip storage module.

In one embodiment, the order of reading out OCH data between slices of data is illustrated in a 4-fold resolution mode, which can be analogized to a 2-fold resolution mode. In the 4-fold resolution mode, each group of data is divided into 4 pieces of slice data, that is, slice data 0, slice data 1, slice data 2, and slice data 3,4 pieces of slice data are sequentially output. As shown in fig. 10, rp represents slices, and for convenience of understanding, OCH data under each slice of data is grouped and numbered according to C4, C8, and C16, and the readout sequence of each slice of data is shown in fig. 11, where X may be 4, 8, and 16. In order to ensure the continuity of the convolution resolution of the next layer of processing units, the same groups of different pieces of data are spliced according to the piece sequence, and then the different groups are spliced according to the group sequence. A set of data can be understood as a piece of data as well, and ping-pong reads and writes can be understood as supporting a set of data parallelism, 2 sets of data parallelism, and 4 sets of data parallelism. The addressing mode in each group can refer to the addressing mode in the slice data, and the addressing mode between each group of data can refer to the addressing mode between the slice data, which is not described herein.

The embodiment shown in fig. 12 shows the principle of operation of the vector processing unit in the second path. The first integer data of the input vector processing unit under the second path is input by the first input buffer unit, and the vector processing unit under the second path can adopt various operators to operate on the input data, including but not limited to an Eltwise operator, a pooling operator, a global pooling operator, a Resize operator, an activation function operator, a binocular operator and the like. The Eltwise operator, the pooling operator, the global pooling operator, the Resize operator, the activation function operator and the binocular operator can multiplex or share the same computing resources (an adder, a multiplier and a comparator) and storage resources, namely, the adder, the multiplier and the comparator used in the operations of the Eltwise operator, the pooling operator, the global pooling operator, the Resize operator, the activation function operator and the binocular operator are at least partially the same, so that the occupied area of the vector processing unit is reduced. The vector processing unit can also add other operators under the condition of reasonably utilizing the reusable resources, and has good function expandability. In some embodiments, if the data entering the Eltwise operator is preceded by a coefficient, the coefficient is considered in the dequantization parameter.

In one embodiment, the Eltwise operator is used to perform dot multiplication, addition and subtraction and maximum value operation on data corresponding to input data of different channels. The vector processing unit splits input data according to the input data type, and performs operations such as multiplication, addition, maximum value taking and the like on a plurality of pixel data of a channel a and a plurality of pixel data of another channel b in parallel; and then writing the output data into an output buffer unit according to the result after operation and the output data type.

The pooling mode adopted by the global pooling operator comprises maximum pooling and average pooling, and the pooling window is of variable size and is used for changing the input data from w to h to 1. Taking the parallel input of data in a plurality of columns of directions of each channel as an example to explain the principle of a global pooling operator, firstly, taking the maximum value or summing the data in the plurality of columns of directions; in order to improve the acceleration performance, the vector processing unit further comprises a register, and a result of taking the maximum value or solving the average value is written into the register and defines an initial value for global pooling operation, wherein the initial value is infinitesimal represented by the floating point number in the maximum value mode; in average mode, the initial value is 0. When the register is empty, the result output in the first step is accumulated or compared with the initial value, and the accumulated or compared result is written into the register; when the register is not empty, the result output in the first step is accumulated or compared with the data in the register, and the accumulated or compared result is written back to the register again. This is looped until the last data of the input data. Because the floating point adder or comparator calculates the characteristic of a result in a plurality of periods, after the last data accumulation or comparison of the input data is finished, part of intermediate results are cached in the register, so that the rest data in the register are accumulated or compared one by one until the register is empty; in the maximum mode, the output result after the register is empty is the operation result of global pooling operation; in average mode, the output result after the register is empty is divided by the size of the input data, i.e. multiplied by the inverse of the size of the input data (floating point number) to obtain the final result.

The pooling operator operation is also called downsampling, and is used for performing dimension reduction on data, and can be regarded as a fuzzy filter, so that the influence caused by overfitting and distortion is prevented, and the anti-warping capability of the neural network is enhanced. The pooling mode of the pooling operator comprises maximum pooling and average pooling. The size of the pooling window is 2 x 2 or 3*3, the step size is 1 or 2, three modes of a general mode, an upward rounding mode and a downward rounding mode are supported, and the pooling mode needs to be judged according to the size of input and output data and the filling condition.

In one embodiment, the specific mode can be determined according to the width (Y) of the input and output, the direction and the filling condition. Specifically, the width Y1 of the source image in the general mode can be calculated according to the width Y0 of the target image, and then the mode of the pooling operator can be determined by comparing the width Y2 of the source image actually input with the calculated width Y1 of the source image, and the calculation mode of Y1 is as follows:

Y1＝Y0×pool_stride+(pool_size-pool_stride)-pad_left-pad_right。

wherein pool_stride represents the step size, pool_size represents the pooling window size, and pad_left and pad_right represent the filling numbers of the left and right parts, respectively. In the general mode, y1=y2; y1 > Y2 in the rounding-up mode; in the rounding down mode, Y1< Y2.

The Resize operator is used to interpolate the input data, e.g., the Resize operator interpolates the input feature map. In some embodiments, the resolution operator employs bilinear interpolation algorithms, such as interpolation in the lateral direction followed by interpolation in the longitudinal direction, or interpolation in the longitudinal direction followed by interpolation in the lateral direction.

In summary, the vector processing unit provided by the embodiment of the application has the advantages of high bandwidth utilization rate, high data parallelism and capability of fully utilizing the computing power of the neural network processor to process vector data in a running way, and improves the acceleration performance.

The application also provides a depth camera, which comprises the neural network processor, wherein the neural network processor is used for processing the image data acquired by the depth camera. The depth camera may also include other components such as a laser emitting module, an infrared imaging module, etc., not shown.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A vector processing unit for a neural network processor, comprising:

the precision lifting module is used for lifting the input first integer data to the first floating point data;

the inverse quantization module is used for carrying out inverse quantization on the first floating point type data to obtain inverse quantization data, wherein the inverse quantization data is floating point type data;

the operation module is used for operating the inverse quantization data to obtain second floating point type data;

the quantization module is used for quantizing the second floating point type data to obtain quantized data, wherein the quantized data is floating point type data;

and the precision reducing module is used for reducing the precision of the quantized data to obtain second integer data.

2. The vector processing unit of claim 1, wherein the operation module comprises a comparison operator, a multiplication operator, and an addition operator, the operation module implementing at least one of an activate function operator operation, an Eltwise operator operation, a Resize operator operation, a pooling operator operation, a global pooling operator operation, a binocular operator operation through the comparison operator, the multiplication operator, and the addition operator.

3. The vector processing unit of claim 2, wherein the activation functions in the activation function operators are derived by piecewise polynomial fitting.

4. A vector processing unit according to claim 3, wherein said activation function has an exponential function shape, said activation function having the expression:

wherein x is input data of an activation function sigmoid (x), and a and b are obtained by looking up a table of a segmentation interval where x is located.

5. The vector processing unit of claim 2, wherein said compare operator, said multiply operator, and said add operator used in said operations performed by said activate function operator, said Eltwise operator, said Resize operator, said pooling operator, said global pooling operator, and said binocular operator are at least partially identical.

6. The vector processing unit of claim 1, wherein said precision lifting module is configured to bit expand said first integer data and then lift the precision to said first floating point data when said first integer data is not of a predetermined type.

7. The vector processing unit of any of claims 1 to 6, further comprising an on-chip memory module for storing the second integer data and writing the second integer data into an off-chip memory in a data arrangement format required by a next processing unit of the neural network processor connected to the vector processing unit.

8. The vector processing unit of claim 7, wherein the on-chip memory module is configured to allocate a read address and a write address space according to a data type of the second integer data, a data arrangement format and a data type required by the next processing unit, and a corresponding selected operation mode pair, so as to write the second integer data into an off-chip memory; the data arrangement format comprises NC4HW, NC8HW and NC16HW; the working modes comprise a full-output channel mode, a double resolution parallel mode and a quadruple resolution parallel mode.

9. A neural network processor, characterized in that the neural network processor comprises the vector processing unit according to any one of claims 1 to 8, a first input buffer unit, a convolution unit, and a second input buffer unit, the convolution unit being configured to perform convolution processing on input data;

the first input buffer unit inputs the input data to the vector processing unit through a first path or a second path;

the first path is used for transmitting the input data buffered in the first input buffer unit to the convolution unit, outputting the input data to the second input buffer unit after being convolved by the convolution unit, and inputting the vector processing unit by the second input buffer unit, wherein the first integer data is the data obtained after being convolved by the convolution unit;

the second path is that the first input buffer unit directly transmits the input data to the vector processing unit, and the first integer data is the input data.

10. The neural network processor of claim 9,

under the first path, the inverse quantization parameter of the inverse quantization module is the product of the inverse quantization parameter required by the input data and the inverse quantization parameter of the convolution unit weight;

and under the second path, the inverse quantization parameter of the inverse quantization module is the inverse quantization parameter required by the input data.

11. A depth camera comprising the neural network processor of any one of claims 9-10.