WO2019084788A1

WO2019084788A1 - Computation apparatus, circuit and relevant method for neural network

Info

Publication number: WO2019084788A1
Application number: PCT/CN2017/108640
Authority: WO
Inventors: 谷骞; 高明明; 李涛
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2019-05-09
Also published as: CN108780524A; US20200134435A1

Abstract

Provided are a computation apparatus, circuit, and relevant method for a neural network. The computation apparatus comprises: a first processing unit, which is used for performing a first computation operation on k1 pieces of input feature data according to the size of a calculation window, to obtain intermediate results, wherein the size of the calculation window is k1×k2, k1 and k2 being both positive integers; and a second processing unit, which is used for performing, according to the size of the calculation window, a second computation operation on k2 intermediate results output by the first processing unit, to obtain a calculation result. The computation apparatus can effectively save on a cache space, so as to save on hardware resources, and can also reduce the delay of data processing.

Description

Arithmetic device, circuit and related method for neural network

Copyright statement

The disclosure of this patent document contains material that is subject to copyright protection. This copyright is the property of the copyright holder. The copyright owner has no objection to the reproduction of the patent document or the patent disclosure in the official records and files of the Patent and Trademark Office.

Technical field

The present application relates to the field of neural networks and, more particularly, to an arithmetic device, circuit and related method for a neural network.

Background technique

The convolutional neural network is formed by superimposing multiple layers. The result of the upper layer is output feature maps (OFMs), which is used as the input feature map of the next layer. Usually, the output feature map of the middle layer has many channels and the image is relatively large. When the hardware accelerator of the convolutional neural network processes the feature map data, due to the limitation of the on-chip system cache size and bandwidth, an output feature map is usually divided into multiple feature map segments, and each feature image is sequentially output. Segments, and each feature picture segment is output in parallel in columns. For example, a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns.

Currently, in the image processing process, a line buffer is usually used to implement data input of a convolution layer operation or a pooling layer operation. The structure of the line buffer requires that the input data be input in the order of rasterization in the order of rows (or columns). Taking the height of the pooled window as k and the width of the input feature matrix as W, the line buffer needs to be buffered by the depth k*W, that is, the line buffer must buffer the input data of size k*W before the line buffer can be cached. Data operations, which increase the latency of data processing.

As can be seen from the above, the existing image processing scheme requires a large buffer space and a large delay in data processing.

Summary of the invention

The present application provides an arithmetic device, a circuit, and a related method for a neural network, which can effectively save buffer space and reduce the delay of data processing.

In a first aspect, an arithmetic device for a neural network is provided, the computing device comprising: a first processing unit, configured to perform a first operation on k1 input feature data according to a size of a calculation window Operation, obtaining an intermediate result, the size of the calculation window is k1 × k2, k1 and k2 are both positive integers; the second processing unit is configured to output k2 of the first processing unit according to the size of the calculation window The intermediate result performs a second arithmetic operation to obtain a calculation result.

In the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, the row or column buffer can be used. Input the feature matrix, and can perform operations at the same time, without having to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced, and the neural network can be effectively improved. Data processing efficiency, while also saving storage resources, thereby saving hardware resources.

With reference to the first aspect, in a possible implementation manner of the first aspect, the computing device includes M first processing units and M second processing units, and the M first processing units One-to-one corresponding to the M second processing units, M is a positive integer greater than 1; the computing device further includes: a pre-processing unit, configured to receive the input feature matrix by column, and receive the received image according to the calculation window The column input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing unit is further configured to input the M sets of data one-to-one to the M firsts Processing unit.

In the technical solution provided by the present application, parallel processing of image data can be realized, thereby effectively improving the efficiency of data processing.

In a second aspect, a circuit for processing a neural network is provided, the circuit comprising: a first processing circuit, configured to perform a first operation operation on the k1 input feature data according to a size of the calculation window, to obtain an intermediate result, The size of the calculation window is k1×k2, and k1 and k2 are both positive integers; the second processing circuit is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, Get the calculation results.

With reference to the second aspect, in a possible implementation manner of the second aspect, the circuit includes M first processing circuits and M second processing circuits, and the M first processing circuits One-to-one correspondence with the M second processing circuits, M is a positive integer greater than 1; the circuit further includes: a pre-processing circuit for receiving an input feature matrix by column, and receiving the column according to the calculation window The input feature values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing circuit is further configured to input the M sets of data one-to-one to the M first processes In the circuit.

A third aspect provides a method for processing a neural network, the method comprising: performing a first operation operation on k1 input feature data according to a size of a calculation window to obtain an intermediate result, where the size of the calculation window is k1 ×k2, k1 and k2 are both positive integers; according to the size of the calculation window, a second operation operation is performed on the k2 intermediate results obtained by the first operation operation, and a calculation result is obtained.

With reference to the third aspect, in a possible implementation manner of the third aspect, the method further includes: receiving an input feature matrix by a column, and processing the received column input feature value according to the calculation window to obtain the M group Data, wherein each set of data includes k1 input feature values; the first operation operation is performed on the k1 input feature data according to the size of the calculation window, and the intermediate result is obtained, including: according to the size of the calculation window, respectively The M group data performs the first operation operation to obtain a corresponding intermediate result; the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained. The method includes: performing a second operation operation for obtaining a k2 intermediate results for each of the first operation operations corresponding to each of the M groups of data, to obtain a corresponding calculation result.

A fourth aspect provides a computer readable storage medium having stored thereon a computer program, the computer program being executed by a computer for implementing: converting k1 according to a size of a calculation window Performing a first operation operation on the feature data, obtaining an intermediate result, the size of the calculation window is k1×k2, and k1 and k2 are both positive integers; k2 obtained for the first operation operation according to the size of the calculation window The intermediate result is subjected to a second arithmetic operation to obtain a calculation result.

With reference to the fourth aspect, in a possible implementation manner of the fourth aspect, when the computer program is executed by a computer, the computer program is further configured to: receive an input feature matrix by column, and input features to the received column according to the calculation window. The values are processed to obtain M sets of data, wherein each set of data includes k1 input feature values; when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window. Obtaining an intermediate result, comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to a size of the calculation window, respectively obtaining a corresponding intermediate result; the computer program When executed by the computer, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, including: when the computer program is executed by the computer Implementing: for each first operation operation corresponding to each group of data in the M group data, each obtained k The two intermediate results are performed, and the second calculation operation is performed to obtain a corresponding calculation result.

In summary, in the technical solution provided by the present application, by decomposing the window operation of the neural network into a column operation and a row operation, the calculation can be started as long as one row or one column of input data is received, in other words, it can be pressed. Rows or columns are used to buffer the input feature matrix, and the operations can be performed at the same time. It is not necessary to cache a certain amount of two-dimensional input data before the calculation can be performed, so that the delay of data processing can be effectively reduced. It can effectively improve the data processing efficiency of the neural network, and also save storage resources, thereby saving hardware resources.

DRAWINGS

Figure 1 is a schematic diagram of a neural network convolutional layer operation.

2 is a schematic diagram of a neural network pooling layer operation.

FIG. 3 is a schematic block diagram of an operation apparatus for a neural network according to an embodiment of the present application.

FIG. 4 is a schematic block diagram of an operation apparatus for a neural network according to another embodiment of the present application.

FIG. 5 is a schematic block diagram of an operation device for a neural network according to still another embodiment of the present application. Figure.

FIG. 6 is a schematic diagram of a circuit for processing a neural network according to an embodiment of the present application.

FIG. 7 is a schematic block diagram of a circuit for a neural network according to another embodiment of the present application.

FIG. 8 is a schematic flowchart of a method for processing a neural network according to an embodiment of the present application.

Detailed ways

In order to facilitate the understanding of the technical solution provided by the present application, the following describes the convolution layer operation and the pooling layer operation in the convolutional neural network.

1) Convolution layer operation

The operation of the convolution layer operation is to slide a fixed-size window across the entire image plane, and multiply and accumulate the data covered in the window at each moment. In a convolutional layer operation, the window slides in steps of 1.

Figure 1 is a schematic diagram of a convolution layer operation. The input image has a high H1 of 3 and a width W1 of 4; the convolution window has a high k1 of 2 and a width k2 of 2. The convolution layer operation is a 2×2 convolution window sliding on a 3×4 image at intervals of 1 step, and the 4 data covered by each convolution window are multiplied and accumulated to obtain an output result, all The output results constitute an output image. As shown in FIG. 1, the output image has a height H2 of 2 and a width W2 of 3.

The output result o1 shown in Fig. 1 is obtained by the following formula:

O1=op{d1,d2,d3,d4},

Among them, the operation mode of the operator op is multiply and accumulate.

2) Pool layer operation

The operation of the pooling layer operation is to slide a fixed-size window across the entire image plane, and calculate the data covered in the window at each moment to obtain a maximum value or an average value as an output. In a layered operation, the step size of the window slide is equal to the height (or width) of the window.

Figure 2 is a schematic diagram of the pooling layer operation. The input image has a high H1 of 6, and a width W1 of 8; the pooled window has a high k1 of 2 and a width k2 of 2. The pooling layer operation is a 2×2 pooling window sliding on a 6×8 image at intervals of 2 steps, and the 4 data covered by each window obtains an output result, and all output results constitute an output image. As shown in FIG. 2, the output image has a height H2 of 3 and a width W2 of 4.

The output result o1 shown in Fig. 2 is obtained by the following formula:

O1=op{d1,d2,d3,d4},

Among them, depending on the configuration, the operation mode of the operator op is to find the maximum value (max) or the average value (avg).

In the existing neural network calculation process (convolution operation or pooling operation), it is usually "first take the window and then perform the calculation". Taking the pooling operation shown in FIG. 2 as an example, the four input data covered by the pooled window are first acquired, and then the four input data are calculated.

In the present application, the process of "first taking a window and then performing calculation" is decomposed into a column operation and a row operation.

Optionally, as an implementation manner, the process of “taking the window first, then performing the calculation” is decomposed into a prior operation, and then the operation is performed.

Specifically, the data of the same column in the window is first calculated to obtain an intermediate result; then the intermediate result of all the columns in the window is calculated to obtain a calculation result.

Taking the window 2×2 shown in Fig. 1 as an example, d1, d2, d3, and d4 participate in the operation, and the result o1=op{d1, d2, d3, d4} is obtained. In the present application, the operation o1=op{d1, d2, d3, d4} of the window 2×2 as shown in FIG. 1 is decomposed into: firstly, the column operations of d1 and d3 in the same column in the window are performed, and an intermediate result is obtained. P1=op1{d1,d3}, and d2 and d4 in the same column in the window perform column operations, and get the intermediate result p2=op2{d2,d4}; then the intermediate results p1 and p2 of all columns are processed to obtain the final The result of the operation is o1=op3{p1p2}. Among them, the operation modes of the operators op1 and op2 are multiply and accumulate, and the operation mode of op3 is accumulative.

Taking the window 2×2 shown in FIG. 2 as an example, d1, d2, d3, and d4 participate in the operation, and the result o1=op{d1, d2, d3, d4} is obtained. In the present application, the operation of the window 2×2 as shown in FIG. 2, o1=op{d1, d2, d3, d4} is decomposed into: firstly, the column operations of d1 and d3 in the same column in the window are performed to obtain an intermediate result. P1=op1{d1,d3}, and d2 and d4 in the same column in the window perform column operations, and get the intermediate result p2=op2{d2,d4}; then the intermediate results p1 and p2 of all columns are processed to obtain the final The result of the operation is o1=op3{p1p2}. Among them, the calculation methods of the operators op1, op2, and op3 are all seeking the maximum value or averaging.

Optionally, as an implementation manner, the process of “first taking a window and then performing calculation” is decomposed into a pre-operation and a column operation.

Specifically, the data of the same row in the window is first calculated to obtain an intermediate result; then the intermediate result of all the rows in the window is calculated to obtain a calculation result.

As can be seen from the above, in the present application, by decomposing the window operation of the neural network into column operations and row operations, the calculation can be started as long as one row or column of input data is received, without In the technology, a certain amount of two-dimensional input data must be cached before the calculation can be performed, so that the delay of data processing can be effectively reduced. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned.

The arithmetic device, circuit and related method for neural network provided by the present application are described in detail below.

FIG. 3 is a schematic block diagram of an arithmetic device 300 for a neural network provided by the present application. The computing device 300 includes:

The first processing unit 310 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers.

The second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit according to the size of the calculation window, to obtain a calculation result.

Optionally, the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data in the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window. Width; the second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.

In this embodiment, the first processing unit 310 may be referred to as a column processing unit, and correspondingly, the first operation operation is referred to as a column operation; the second processing unit 320 may be referred to as a row processing unit, and correspondingly, the second operation operation is called For the line operation.

Optionally, the first processing unit 310 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window. The second processing unit 320 is configured to perform a second operation operation on the k2 intermediate results output by the first processing unit, which is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.

In this embodiment, the first processing unit 310 may be referred to as a row processing unit, and correspondingly, the first operation operation is referred to as a row operation; the second processing unit 320 may be referred to as a column processing unit, and correspondingly, the second operation operation is called For column operations.

In the technical solution provided by the present application, by decomposing the window operation of the neural network into column operations And the row operation, so that as long as one row or one column of input data is received, the calculation can be started. In other words, the input feature matrix can be buffered by row or column, and the operation can be performed simultaneously without having to cache first enough in the prior art. After a certain amount of two-dimensional input data, the calculation can be performed. Therefore, the delay of data processing can be effectively reduced, the data processing efficiency of the neural network can be effectively improved, and the storage resources can be saved, thereby saving hardware resources.

The following is mainly described by taking the first-line processing and the subsequent processing as an example, but the embodiment of the present application is not limited thereto. According to actual needs, you can process the re-column processing first.

Optionally, as an embodiment, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.

Take the input image and the convolution window shown in FIG. 1 as an example, and take the operation after the first column operation as an example. First, the column operations of d1 and d3 in the same column in the window are performed, and the intermediate result p1=op1{d1, d3} is obtained, and the d2 and d4 in the same column in the window are column-operated, and the intermediate result p2=op2{d2, d4} is obtained. Then, the intermediate results p1 and p2 of all the columns are operated to obtain the final operation result o1=op3{p1p2}. Among them, the operation modes of the operators op1 and op2 are multiply and accumulate, and the operation mode of op3 is accumulative.

This embodiment can improve the convolution operation efficiency of the neural network.

Optionally, in another embodiment, the calculation window is a pooled window, and the operation manner of the first operation operation is a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation are The operation is the same.

Take the input image and the convolution window shown in FIG. 2 as an example, and take the operation after the first column operation as an example. First, the column operations of d1 and d3 in the same column in the window are performed, and the intermediate result p1=op1{d1, d3} is obtained, and the d2 and d4 in the same column in the window are column-operated, and the intermediate result p2=op2{d2, d4} is obtained. Then, the intermediate results p1 and p2 of all the columns are operated to obtain the final operation result o1=op3{p1p2}. Among them, the calculation methods of the operators op1, op2, and op3 are all seeking the maximum value or averaging.

This embodiment can improve the pooling operation efficiency of the neural network.

Optionally, as shown in FIG. 4, the computing device includes M first processing units 310 and M second processing units 320, and the M first processing units 310 and the M second processing units 320 One-to-one correspondence, M is a positive integer greater than one;

The computing device 300 further includes:

The pre-processing unit 330 is configured to receive the input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M group data, wherein each group of data includes k1 inputs. The pre-processing unit is further configured to input the M-group data one-to-one into the M first processing units.

Specifically, the pre-processing unit 330 receives the first column input feature value in the input feature matrix, processes it into M group data, and inputs the M first processing units 310 into the column processing respectively; the M first processing units 310 M intermediate results are output, and the M intermediate results are input one-to-one into the M second processing units 320. The pre-processing unit 330 receives the second column input feature values in the input feature matrix, processes them into M groups of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M The intermediate result is input to the M second processing units 320 one-to-one by the M intermediate results. By the way, when the pre-processing unit 330 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing units 310 for column processing respectively; the M first processing units 310 outputs M. Intermediate results, and the M intermediate results are input one-to-one into the M second processing units 320. At this time, each of the M second processing units 320 has received k2 As a result of the middle, the second processing unit 320 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing units 320 obtain M operation results. Subsequently, the pre-processing unit 330 may continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.

As described above, in the current technology, an output feature map is generally divided into a plurality of feature picture segments, and each feature picture segment is sequentially output, and each feature picture segment is output in parallel in columns. For example, a complete output feature map is divided into three feature image segments, and each feature image segment is sequentially output in columns. In the prior art, the data of the feature picture segment is input in columns, the line buffer is input in rows, and the data corresponding to the feature picture segment is input in parallel, but the way of the line buffer is serial processing data, which leads to input. The output rate is not matched, and the rate of throughput data is too low, which will become the bottleneck of the accelerator and reduce the speed of the accelerator.

In the present application, the pre-processing unit 310 receives a feature picture segment in columns, and the M first processing units perform column operations on the column input feature values of the feature picture segment, and the M second processing units are configured according to the M first processing units. The intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.

In the technical solution provided by the present application, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in rows, it is cached by line. The data is processed by performing row operations on the cached data before performing column operations, thereby improving data throughput. At the same time, the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.

Optionally, in the embodiment, the number M of the first processing unit 310 and the second processing unit 320 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.

Taking the calculation window as a convolution window as an example, the first processing unit 310 performs column processing, and the second processing unit 320 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = H-(k1-1).

In this embodiment, the M group data includes all the data in the column input feature values, that is, the computing device 300 provided by the present application can implement parallel processing of a column of input feature values.

Taking the calculation window as a pooled window as an example, the first processing unit 310 performs column processing, and the second processing unit 320 performs row processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = mod (H / k1).

When H can be divisible by k1, the M group data includes all data in the column input feature values, that is, the arithmetic device 300 provided by the present application can implement parallel processing of a column of input feature values.

When H is not divisible by k1, the M group data is part of the input feature value of the column; the preprocessing unit 330 further includes a buffer; the preprocessing unit 330 is further configured to: The remaining data other than the M group of data is stored in the buffer.

In this scenario, the data of the last few rows of the input feature matrix needs to be cached in the buffer first, and then processed separately.

For example, in an scene in which an output feature map is divided into a plurality of feature picture segments, each feature picture segment is output in parallel in columns. Suppose that an output feature map is divided into two feature image segments. The height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing unit 310.

FIG. 5 is a schematic block diagram of an arithmetic device 500 for a neural network provided by the present application. The computing device 500 includes a pre-processing unit 510, M column processing units 520, M row processing units 530, and M column processing units 520 that are in one-to-one correspondence with M row processing units 530.

The pre-processing unit 510 is configured to receive input data, and pre-predict the input data according to the calculation window. Processing, obtaining M sets of data, each set of data includes k2 input feature values, and the M sets of data are input one-to-one into the M column processing units, the calculation window having a height of k1 and a width of k2.

Specifically, the pre-processing unit 510 is configured to receive input data, and specifically includes: the pre-processing unit 510 receives the input feature matrix in columns.

The column processing unit 520 is configured to perform column operations on the input k2 input feature values, obtain an intermediate result, and input the intermediate result into the corresponding row processing unit 530.

Specifically, for pooled layer operations, column operations refer to maximizing or averaging. For convolutional layer operations, column operations refer to multiply and accumulate operations.

The row processing unit 530 is configured to buffer the intermediate result output by the corresponding column processing unit 520. When k2 intermediate results are received, the k2 intermediate results are row-operated to obtain a calculation result.

Specifically, for the pooling layer operation, the operation mode corresponding to the row operation is the same as the operation mode corresponding to the column operation. For convolutional layer operations, row operations refer to accumulation operations.

As shown in FIG. 5, the calculation results of the M line processing units 530 constitute the output data of the arithmetic unit 500.

Optionally, in this embodiment, the input data received by the pre-processing unit 510 is a feature picture segment obtained by the input feature map to be processed.

Optionally, in some embodiments, the number M of column processing units 520 and row processing units 530 is determined by the size of the input feature matrix received by the pre-processing unit 510 and the size of the computation window.

Specifically, the input feature matrix is a feature picture segment.

Suppose a complete input feature map is divided into several feature image segments. The pre-processing unit 510 is configured to sequentially receive the plurality of feature picture segments.

In some cases, the sliding window (ie, the calculation window) may cover part of the data of the upper and lower feature image segments at the same time. For this case, the pre-processing unit 510 is configured to buffer the last few rows of data of the previous feature picture segment in the window in the buffer of the pre-processing unit 510 (as shown in FIG. 5), and wait for the next feature picture. When the segment input is valid, the buffered data is read out from the buffer, and the current data (ie, the currently input feature picture segment) is spliced into a new feature picture segment, and then the new data is remapped to the M column processing units 520. in.

In this embodiment, the cache space can be effectively saved, thereby saving hardware resources.

As an example, an input feature map with a height of 6 and a width of 8 as shown in FIG. 2, a pooled window having a size of 2×2 and a step size of 2 is taken as an example, and each column processing unit 520 in FIG. 5 may be Processing 2 rows of data of the same column, each row processing unit 530 can process 2 columns of data of the same row, as shown in FIG. Only three column processing units 520 and three row processing units 530 need to be provided in the arithmetic unit.

As another example, assume that the input feature map is divided into two feature picture segments segment1 and segment2, and the height h of segment1 and segment2 is 14, the pooling window size is 3×3, and the step size is 2. The pre-processing unit 510 needs to buffer the last two rows of segment1 into the buffer before processing segment1, and after receiving segment2, splicing 14 rows of data with segment2 into a new feature image segment with a height of 16 and then Remapped into column processing unit 520.

The present application breaks down the window operations of the neural network into column operations and row operations, so that as long as one row or column of input data is received, the calculation can be started without having to cache a certain amount of two-dimensional input data as in the prior art. The calculation can be performed, so that the delay of data processing can be effectively reduced, so that real-time data processing can be realized. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned. In addition, the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead. The computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.

As shown in FIG. 6, the embodiment of the present application further provides a circuit 600 for processing a neural network. The circuit 600 may correspond to the

arithmetic device

300 or 500 provided by the above embodiment. As shown in Figure 6, the circuit 600 includes:

The first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;

The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, to obtain a calculation result.

Optionally, the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the column input feature values in the input feature matrix, where k1 represents a height of the calculation window, and k2 represents the calculation window. The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit 610, which is equivalent to performing a second operation operation on the intermediate results of the different columns to obtain a calculation result.

In this embodiment, the first processing circuit 610 may be referred to as a column processing circuit, correspondingly, the first operation The arithmetic operation is referred to as a column operation; the second processing circuit 620 may be referred to as a row processing circuit, and correspondingly, the second arithmetic operation is referred to as a row operation.

Optionally, the first processing circuit 610 is configured to perform a first operation operation on the k1 input feature data of the row input feature values in the input feature matrix, where k1 represents a width of the calculation window, and k2 represents the calculation window. The second processing circuit 620 is configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit, and is equivalent to performing a second operation operation on the intermediate results of the different rows to obtain a calculation result.

In this embodiment, the first processing circuit 610 may be referred to as a row processing circuit, and correspondingly, the first operation operation is referred to as a row operation; the second processing circuit 620 may be referred to as a column processing circuit, and correspondingly, the second operation operation is called For column operations.

Optionally, in some embodiments, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.

Optionally, in some embodiments, the calculation window is a pooling window, and the operation manner of the first operation operation is to obtain a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation The operation is the same.

Optionally, as shown in FIG. 7, as an embodiment, the circuit 600 includes M first processing circuits 610 and M second processing circuits 620, and the M first processing circuits 610 and the M The second processing circuit 620 has a one-to-one correspondence, and M is a positive integer greater than 1. The circuit 600 further includes: a pre-processing circuit 630, configured to receive the input feature matrix by column, and perform the received column input feature value according to the calculation window. Processing, the M sets of data are obtained, wherein each set of data includes k1 input feature values, and the pre-processing circuit 630 is further configured to input the M sets of data into the M first processing circuits 610 one-to-one.

Specifically, the pre-processing circuit 630 receives the first column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; M first processing circuits 610 Output M intermediate results and lose the M intermediate results one-to-one The M second processing circuits 620 are entered. The pre-processing circuit 630 receives the second column input feature values in the input feature matrix, processes them into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M The intermediate result is input to the M second processing circuits 620 one-to-one by the M intermediate results. By the way, when the pre-processing circuit 630 receives the input feature value of the k2th column, it processes it into M sets of data, and inputs them into the M first processing circuits 610 for column processing respectively; the M first processing circuits 610 outputs M. An intermediate result, and the M intermediate results are input one-to-one into the M second processing circuits 620. At this time, each of the M second processing circuits 620 has received k2 As a result of the middle, the second processing circuit 620 performs a row operation on the received k2 intermediate results to obtain an operation result, that is, the M second processing circuits 620 obtain M operation results. Subsequently, the pre-processing circuit 630 can continue to receive the column input feature values, and repeatedly perform the above-described process to obtain the next M operation results, which are not described herein again.

In the present application, the pre-processing circuit 610 receives a feature picture segment in columns, and the M first processing circuits perform column operations on the column input feature values of the feature picture segment, and the M second processing circuits are based on the M first processing circuits. The intermediate result of the output is subjected to a row operation, thereby obtaining a calculation result of the feature picture segment, that is, a neural network processing result of the feature picture segment.

In the technical solution provided by the present application, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to column operations and then performed; For example, if the input data is input in a row, the row buffer is performed, and the cached data is first subjected to row operations and then column operations, thereby improving data throughput. At the same time, the arithmetic device provided by the embodiment can implement parallel processing of image data, thereby effectively improving the efficiency of data processing.

Optionally, in the embodiment, the number M of the first processing circuit 610 and the second processing circuit 620 included in the computing device 300 is determined according to the size of the input feature matrix and the size of the calculation window.

Taking the calculation window as a convolution window as an example, the first processing circuit 610 performs column processing, and the second processing circuit 620 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = H-(k1-1).

Taking the calculation window as a pooled window as an example, the first processing circuit 610 performs column processing, and the second portion The processing circuit 620 performs line processing as an example. Suppose the behavior of the input feature matrix H, H is an integer greater than or equal to k1, the height of the convolution window is k1, the width is k2, then M = mod (H / k1).

In this embodiment, the M sets of data include all of the data in the column of input feature values.

When H is not divisible by k1, the M group data is part of the input characteristic value of the column; the pre-processing circuit 630 further includes a buffer; the pre-processing circuit 630 is further configured to divide the column input feature value The remaining data other than the M group of data is stored in the buffer.

In this embodiment, the M group data is part of the input feature values of the column. In this scenario, the data of the last few rows of the input feature matrix needs to be first buffered in the buffer, and then processed separately.

For example, in an scene in which an output feature map is divided into a plurality of feature picture segments, each feature picture segment is output in parallel in columns. Suppose that an output feature map is divided into two feature image segments. The height of the first feature image segment is not divided by the high k1 of the pooled window, and the last few rows of the first feature image segment are first buffered in the buffer. Waiting until the input of the second feature picture segment is valid, reading the buffered data from the buffer, and splicing the current data (the data of the second feature picture segment) into a new feature picture segment, and re-mapping to M Processing is performed in the first processing circuit 610.

Optionally, in some embodiments, the input feature matrix represents a feature picture segment in the image to be processed; the pre-processing circuit 630 is specifically configured to sequentially receive the feature picture segments of the to-be-processed image.

Optionally, the circuit 600 further includes a communication interface for receiving image data to be processed, and for outputting a calculation result of the second processing circuit, that is, outputting image data.

In summary, the technical solution provided by the present application can be started by decomposing the window operation of the neural network into column operations and row operations, so that as long as one row or one column of input data is received, the calculation can be started without having to be prior art as in the prior art. After you have cached a certain amount of 2D input data, you can do it. The calculation, therefore, can effectively reduce the delay of data processing, so that real-time data processing can be realized. At the same time, the data cache mode can be flexibly set according to the input mode of the input data. For example, if the input data is input in columns, the column cache is performed, and the cached data is first subjected to the column operation and then the row operation; for example, the input data is pressed. If the row is input, it is cached by row, and the cached data is first processed and then columned. In addition, the computing device provided by the present application requires less buffer space than the prior art, thereby saving hardware overhead. The computing device provided by some embodiments can implement parallel processing of multiple computing windows, thereby improving the data throughput rate and overcoming the bottleneck of the neural network accelerator.

As shown in FIG. 8, the embodiment of the present application further provides a method 800 for processing a neural network. Optionally, the method 800 can be performed by the operation device provided in the foregoing embodiment, and the description of the technical solutions and the technical effects in the foregoing embodiments can be applied to the embodiment. For brevity, the description is not repeated herein. As shown in FIG. 8, the method 800 includes the following steps.

810. Perform a first operation operation on the k1 input feature data according to the size of the calculation window, and obtain an intermediate result. The size of the calculation window is k1×k2, and both k1 and k2 are positive integers.

Specifically, step 810 can be performed by the first processing unit 310 in the above embodiment.

820. Perform a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.

Specifically, step 820 can be performed by the second processing unit 320 in the above embodiment.

Optionally, in this embodiment, the method 800 further includes: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 Input feature values. The step 810 specifically includes: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result. Specifically, the first operation operations may be performed on the M group data by the M first processing units 310 in the foregoing embodiment, to obtain corresponding intermediate results. The step 820 specifically includes: performing a k2 intermediate result for each first operation operation corresponding to each group of data in the M group data. The second operation operation obtains a corresponding calculation result. Specifically, the first operation operations may be performed on the M group data by the M second processing units 320 in the foregoing embodiment to obtain corresponding intermediate results.

Optionally, in the embodiment, the value of M is related to the size of the input feature matrix and the size of the calculation window.

Optionally, in this embodiment, the M group data includes all data in the column input feature values.

Optionally, in this embodiment, the M group data is part of the column input feature values; the method 800 further includes: storing the remaining data of the column input feature values except the M group data. buffer.

Optionally, in this embodiment, the calculation window is a convolution window, and the operation mode of the first operation operation is a multiplication and accumulation operation, and the operation mode of the second operation operation is an accumulation operation.

Optionally, in this embodiment, the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, and the operation mode of the second operation operation and the first operation operation The operation is the same.

Optionally, in the embodiment, the input feature matrix represents a feature picture segment in the image to be processed; and receiving the input feature matrix by column comprises: sequentially receiving each feature picture segment of the image to be processed.

The embodiment of the present application further provides a computer readable storage medium, where the computer program is stored, and when the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window. Obtaining an intermediate result, the size of the calculation window is k1×k2, and k1 and k2 are both positive integers; according to the size of the calculation window, performing a second operation operation on the k2 intermediate results obtained by the first operation operation, and obtaining a calculation result .

The description of the technical solutions and technical effects in the above embodiments may be applied to the present embodiment. For brevity, the present embodiment is not described again.

Optionally, in this embodiment, when the computer program is executed by the computer, the method further comprises: receiving an input feature matrix by columns, and processing the received column input feature values according to the calculation window to obtain M group data, wherein Each set of data includes k1 input feature values; when the computer program is executed by a computer, it is implemented: according to the size of the calculation window, the k1 input feature data is subjected to the first An arithmetic operation, obtaining an intermediate result, comprising: when the computer program is executed by the computer, implementing: performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result; the computer program is When the computer executes, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, which is included when the computer program is executed by the computer: For the first operation operation corresponding to each group of data in the M group of data, each k2 intermediate results are obtained, and the second operation operation is performed to obtain a corresponding calculation result.

Optionally, in this embodiment, the M group data is part of the column input feature value; when the computer program is executed by the computer, the method is further implemented to: input the column input feature value by the M group data. The remaining data is stored in the buffer.

Optionally, in this embodiment, the input feature matrix represents a feature picture segment in the image to be processed; when the computer program is executed by the computer, the method is implemented to: receive the input feature matrix by column, including: the computer program is received by the computer The implementation is used to implement: sequentially receiving each feature picture segment of the image to be processed.

The present application is applicable to a convolutional neural network (CNN) hardware accelerator, and the application mode is an IP core, and may also be applied to other types of neural network accelerators/processors including a pooling layer.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, Or other programmable devices. The computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.). The computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (such as a digital video disc (DVD)), or a semiconductor medium (such as a solid state disk (SSD)). .

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present application.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims

An arithmetic device for a neural network, comprising:

a first processing unit, configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;

And a second processing unit, configured to perform a second operation operation on the k2 intermediate results output by the first processing unit according to the size of the calculation window, to obtain a calculation result.
The computing device according to claim 1, wherein said computing device comprises M first processing units and M second processing units, and said M first processing units and said M The second processing units are in one-to-one correspondence, and M is a positive integer greater than one;

The computing device further includes:

a pre-processing unit, configured to receive an input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing The unit is further configured to input the M sets of data one-to-one into the M first processing units.
The arithmetic device according to claim 2, wherein the value of M is related to the size of the input feature matrix and the size of the calculation window.
The arithmetic unit according to claim 2 or 3, wherein said M sets of data include all of said column input feature values.
The computing device according to claim 2 or 3, wherein the M sets of data are part of the column input feature values;

The pre-processing unit further includes a buffer;

The pre-processing unit is further configured to store remaining data of the column input feature values other than the M group data into the buffer.
The computing device according to any one of claims 1 to 5, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the second operation operation The operation method is an accumulation operation.
The computing device according to any one of claims 1 to 5, wherein the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, The operation mode of the second operation operation is the same as the operation mode of the first operation operation.
An arithmetic device according to any one of claims 2 to 5, wherein said said The input feature matrix represents a feature picture segment in the image to be processed;

The pre-processing unit is specifically configured to sequentially receive each feature picture segment of the to-be-processed image.
A circuit for processing a neural network, comprising:

a first processing circuit, configured to perform a first operation operation on the k1 input feature data according to the size of the calculation window, to obtain an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;

And a second processing circuit, configured to perform a second operation operation on the k2 intermediate results output by the first processing circuit according to the size of the calculation window, to obtain a calculation result.
The circuit according to claim 9, wherein said circuit comprises M said first processing circuit and said M said second processing circuits, and said M first processing circuits and said M first The two processing circuits are in one-to-one correspondence, and M is a positive integer greater than one;

The circuit also includes:

a pre-processing circuit, configured to receive an input feature matrix by column, and process the received column input feature values according to the calculation window to obtain M sets of data, wherein each set of data includes k1 input feature values, and the pre-processing The circuit is further configured to input the M sets of data one-to-one into the M first processing circuits.
The circuit of claim 10 wherein the value of M is related to the size of said input feature matrix and the size of said calculation window.
The circuit of claim 10 or 11, wherein said M sets of data comprise all of said column input feature values.
The circuit according to claim 10 or 11, wherein the M sets of data are part of the column input feature values;

The pre-processing circuit further includes a buffer, and the pre-processing circuit is further configured to store the remaining data of the column input feature values other than the M group data into the buffer.
The circuit according to any one of claims 9 to 13, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the operation of the second operation operation The way is the accumulation operation.
The circuit according to any one of claims 9 to 13, wherein the calculation window is a pooled window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, the second The operation mode of the arithmetic operation is the same as that of the first arithmetic operation.
The circuit according to claim 10, wherein said input feature matrix represents a feature picture segment in an image to be processed;

The pre-processing circuit is specifically configured to sequentially receive each feature picture segment of the image to be processed.
A method for processing a neural network, comprising:

Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;

And performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.
The method of claim 17, wherein the method further comprises:

Receiving an input feature matrix according to a column, and processing the received column input feature values according to the calculation window, to obtain M sets of data, wherein each set of data includes k1 input feature values;

Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, including:

Performing the first operation operation on the M group data according to the size of the calculation window, and obtaining a corresponding intermediate result;

Performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and obtaining a calculation result, including:

For each of the first operation operations corresponding to each of the M sets of data, each of the k2 intermediate results is obtained, and the second operation operation is performed to obtain a corresponding calculation result.
The method of claim 18 wherein the value of M is related to the size of said input feature matrix and the size of said calculation window.
The method of claim 18 or 19, wherein said M sets of data comprise all of said column input feature values.
The method according to claim 18 or 19, wherein the M sets of data are part of the column input feature values;

The method also includes storing the remaining data of the column input feature values other than the M sets of data into a buffer.
The method according to any one of claims 17 to 21, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, and the operation of the second operation operation The way is the accumulation operation.
The method according to any one of claims 17 to 21, wherein the calculation window is a pooling window, and the operation mode of the first operation operation is to obtain a maximum value or an average value, the second The operation mode of the arithmetic operation is the same as that of the first arithmetic operation.
The method according to any one of claims 17 to 21, wherein the input feature matrix represents a feature picture segment in an image to be processed;

The receiving the input feature matrix in columns comprises: sequentially receiving each feature image segment of the image to be processed.
A computer readable storage medium having stored thereon a computer program for execution by:

Performing a first operation operation on the k1 input feature data according to the size of the calculation window, and obtaining an intermediate result, where the size of the calculation window is k1×k2, and both k1 and k2 are positive integers;

And performing a second operation operation on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, to obtain a calculation result.
The computer readable storage medium according to claim 25, wherein said computer program is further executed to implement when executed by a computer:

Receiving an input feature matrix according to a column, and processing the received column input feature values according to the calculation window, to obtain M sets of data, wherein each set of data includes k1 input feature values;

When the computer program is executed by the computer, the first operation operation is performed on the k1 input feature data according to the size of the calculation window, and the intermediate result is obtained, including:

When the computer program is executed by the computer, the method is implemented to: perform the first operation operation on the M group data according to the size of the calculation window, and obtain a corresponding intermediate result;

When the computer program is executed by the computer, the second operation operation is performed on the k2 intermediate results obtained by the first operation operation according to the size of the calculation window, and the calculation result is obtained, including:

When the computer program is executed by the computer, the second operation operation is performed for each of the first operation operations corresponding to each of the M sets of data, and the second operation operation is performed to obtain a corresponding Calculation results.
The computer readable storage medium of claim 26, wherein the value of M is related to a size of the input feature matrix and a size of the calculation window.
A computer readable storage medium according to claim 26 or 27, wherein said M sets of data comprise all of said column input feature values.
The computer readable storage medium according to claim 26 or 27, wherein the M sets of data are part of the column input feature values;

When the computer program is executed by the computer, it is further implemented to: store the remaining data of the column input feature values except the M group data into a buffer.
The computer readable storage medium according to any one of claims 25 to 29, wherein the calculation window is a convolution window, and the operation mode of the first operation operation is a multiply and accumulate operation, the second The operation method of the arithmetic operation is an accumulation operation.
The computer readable storage medium according to any one of claims 25 to 29, wherein the calculation window is a pooled window, and the operation manner of the first operation operation is to obtain a maximum value or an average value. The operation mode of the second operation operation is the same as the operation mode of the first operation operation.
The computer readable storage medium according to any one of claims 26 to 29, wherein the input feature matrix represents a feature picture segment in an image to be processed;

When the computer program is executed by a computer, it is used to: receive an input feature matrix by column, including:

When the computer program is executed by the computer, it is used to implement: sequentially receiving each feature picture segment of the image to be processed.