WO2021092941A1

WO2021092941A1 - Roi-pooling layer computation method and device, and neural network system

Info

Publication number: WO2021092941A1
Application number: PCT/CN2019/118933
Authority: WO
Inventors: 谷骞; 高明明; 杨康
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2021-05-20
Also published as: CN112313673A

Abstract

A region of interest (ROI)-pooling layer computation method and device, and a neural network system. The computation device comprises a configuration interface and S computation units. The configuration interface is configured to transmit, to N computation units among the S computation units, configuration information indicating positions of N ROIs, wherein the N ROIs respectively correspond to the N computation units. Each of the N computation units is configured to perform pooling processing on the ROI corresponding thereto so as to acquire output data of the corresponding ROI. The computation device comprises multiple computation units, thereby allowing parallel pooling processing to be performed on multiple ROIs, and improving processing efficiency of a ROI-pooling layer without consuming too much power.

Description

Region of interest-pooling layer calculation method and device, and neural network system

Copyright statement

The content disclosed in this patent document contains copyrighted material. The copyright belongs to the copyright owner. The copyright owner does not object to anyone copying the patent document or the patent disclosure in the official records and archives of the Patent and Trademark Office.

Technical field

This application relates to the field of data processing, and more specifically, to a calculation method and device for a region of interest-pooling layer, and a neural network system.

Background technique

At present, the research of artificial intelligence (AI) has made rapid progress. In particular, the accuracy of convolution neural network (CNN) in the fields of image classification and detection is much higher than that of traditional machine vision algorithms. CNN is composed of several pre-defined basic layers, including convolutional layer, activation layer, pooling layer, fully connected layer, etc., where the pooling layer can include the region of interest (ROI) -Pooling layer (ROI-pooling layer).

In the current technology, the data processing of the region of interest-pooling layer is implemented by a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform. The region of interest-pooling layer is computationally intensive. The computing throughput rate of the CPU computing platform is not high, and it cannot meet the computing performance requirements of the region of interest-the pooling layer. The power consumption of the GPU computing platform is too high. It can be seen that traditional CPU or GPU computing solutions cannot achieve a balance between computing performance and power consumption.

Therefore, it is necessary to propose a processing solution for the region of interest-pooling layer with low power consumption.

Summary of the invention

The present application provides a calculation method and device for a region of interest-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the region of interest-pooling layer without causing large power consumption.

The first aspect provides a computing device for a region of interest-pooling layer. The computing device includes a configuration interface and S computing units, where S is an integer greater than one. The configuration interface is configured to transmit configuration information indicating the positions of N regions of interest to N computing units of S computing units, where N regions of interest correspond to N computing units one-to-one, and N is less than Or a positive integer equal to S. Each of the N computing units is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.

The second aspect provides a calculation method for the region of interest-pooling layer. The calculation method includes: obtaining configuration information indicating the positions of N regions of interest, where N is a positive integer; according to the configuration information, performing parallel pooling processing on the N regions of interest to obtain output data of the corresponding regions of interest.

A third aspect provides a neural network system, which includes the region of interest-pooling layer computing device of the first aspect.

The computing device provided by the present application includes multiple computing units, which can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the region of interest-pooling layer.

Description of the drawings

Figure 1 is a schematic diagram of the function of the region of interest-the pooling layer.

Figure 2 is a schematic diagram of the region of interest-pooling.

Fig. 3 is a schematic block diagram of a computing device according to an embodiment of the present application.

FIG. 4 is a schematic flowchart of obtaining output data of a data window area in an embodiment of the application.

Figure 5 is another schematic diagram of the region of interest-pooling.

Figure 6 is another schematic diagram of the region of interest-pooling.

Figure 7 is another schematic diagram of the region of interest-pooling.

FIG. 8 is a schematic block diagram of a computing unit according to an embodiment of the application.

FIG. 9 is another schematic block diagram of a computing unit according to an embodiment of the application.

Fig. 10 is another schematic block diagram of a computing device according to an embodiment of the present application.

Fig. 11 is a schematic flowchart of a calculation method for a region of interest-pooling layer according to an embodiment of the present application.

Fig. 12 is a schematic block diagram of a neural network system according to an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below in conjunction with the accompanying drawings.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terminology used in the specification of the application herein is only for the purpose of describing specific embodiments, and is not intended to limit the application.

In order to better understand the embodiments of the present application, the following first introduces the related concepts of the region of interest-pooling layer (hereinafter referred to as the ROI-pooling layer).

As shown in Figure 1, the function of the ROI-pooling layer is to down-sample the region of interest (ROI) in the feature map.

The input data (input feature map, IFM) of the ROI-pooling layer is the output of the previous layer. The input data of the ROI-pooling layer can be an array composed of a feature map, or a 3D array composed of multiple feature maps. As shown in Figure 1, the input data of the ROI-pooling layer is L feature maps, and the resolution of each feature map is H (height)×W (width).

The output data (output feature map, OFM) of the ROI-pooling layer is composed of several cubes. As shown on the right side of Figure 1, there are a total of M cubes. The number of cubes output by the ROI-pooling layer is determined by the number of cubes in the input feature map. The number of regions of interest (ROI) is determined.

The dimensions of each cube are the same. For example, in the example in Figure 1, each cube is composed of L output feature maps. Among them, the resolution of each output feature map is the same. For example, in the example of FIG. 1, the resolution of each output feature map in the cube is E (height)×F (width).

The function of the ROI-pooling layer is to downsample the region of interest in the input feature map. For example, in the example in Figure 1, taking a feature map in L feature maps as an example, the ROI-pooling layer downsamples the input feature map with a resolution of H×W into an output with a resolution of E×F Feature map.

The resolution of the feature map output by the ROI-pooling layer may be predefined. For example, in the example of FIG. 1, the resolution E×F of the output cube may be predefined.

The mapping method of the ROI-pooling layer pooling processing can also be defined in advance, and generally there are two types: maximum (max) or average (avg).

It can be understood that the feature of the ROI-pooling layer is that the size of the region of interest to be pooled may not be fixed, and the size of the output feature map corresponding to each region of interest is fixed.

As an example and not a limitation, in Figure 1, the calculation process of the ROI-pooling layer is: according to the resolution of the output cube E×F, the position of the region of interest in the input feature map, point by point inference of the output data in the input The corresponding data window area on the feature map; performing arithmetic processing on the data in the data window area to obtain output data corresponding to the data window area. The arithmetic processing here can be the maximum value or the average value.

For ease of understanding and description, rather than limitation, the concepts and terminology involved in this application are described below.

1. Region of interest

The region of interest represents the region to be pooled (ie, down-sampling) on the input feature map.

2. The resolution of the pooled output frame

The resolution of the pooled output frame indicates the resolution of the feature map obtained after the region of interest is pooled. For example, the resolution of the pooled output frame is the resolution E×F of the output cube shown in FIG. 1.

The pooled output box can also be referred to as a feature map obtained after the region of interest is pooled.

In this article, the pixel points in the feature map obtained after the region of interest undergoes pooling processing are called the output data. ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? It should be understood that, assuming that the resolution of the pooled output frame is E×F, then one region of interest corresponds to (E*F) pieces of output data.

3. Data window area

The data window area represents the area corresponding to the output data corresponding to a certain area of interest on the area of interest.

Taking an output data corresponding to a certain region of interest as an example, the output data is obtained by pooling data in a certain sub-region in the region of interest. A certain sub-area here can be recorded as the data window area.

The above concepts are introduced below in conjunction with the example in Figure 2.

In Figure 2, the resolution of the pooled output frame is 2×2, which means that the resolution of the feature map obtained after any region of interest in the input feature map is pooled is 2×2, or the input feature map Any one of the regions of interest corresponds to 4 output data.

In Fig. 2, a region of interest surrounded by rows 4 to 8, and columns 1 to 7 in the input feature map is shown. Assume that the four output data corresponding to the region of interest are C1, C2, C3, and C4, where the data window area corresponding to C1 on the region of interest is A1, and the data window area corresponding to C2 on the region of interest is The data window area corresponding to A2 and C3 on the area of interest is A3, and the data window area corresponding to C4 on the area of interest is A4. In other words, C1 is obtained by performing arithmetic processing on the data in the data window area A1 (maximum value or average value), and C2, C3, and C4 are obtained by analogy.

It should be understood that FIG. 2 is only an example and not a limitation. In practical applications, one input feature map may include multiple regions of interest.

In traditional technology, a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform is used to process the calculation of the ROI-pooling layer. The computing efficiency of the CPU computing platform cannot meet the computing performance requirements of the ROI-pooling layer, and the GPU computing platform consumes a lot of power.

This application proposes a calculation method and device for the ROI-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the ROI-pooling layer without causing large power consumption.

FIG. 3 is a schematic block diagram of a computing device 300 of the ROI-pooling layer provided by this application.

As shown in FIG. 3, the computing device 300 includes a plurality of computing units 320, and each computing unit 320 has a function of performing pooling processing on a region of interest. It should be understood that through the multiple computing units 320, parallel pooling processing of multiple regions of interest can be implemented. In other words, the computing device provided by the present application can implement parallel data processing of the ROI-pooling layer, thereby improving the processing efficiency of the ROI-pooling layer.

As shown in FIG. 3, the computing device 300 further includes a configuration interface 310 configured to transmit configuration information indicating the positions of the N regions of interest to the N computing units 320, where the N regions of interest and the N computing units 320 One-to-one correspondence.

The configuration information may indicate the positions of the N regions of interest on the input feature map.

For example, the configuration information may indicate the coordinates of the pixel points in the N regions of interest.

Optionally, the position of the region of interest includes the coordinates of the first pixel at the upper left corner of the region of interest and the size of the region of interest.

Optionally, the position of the region of interest includes the coordinates of all pixels in the region of interest.

The configuration interface 310 may be an advanced peripheral bus (APB) interface.

Each of the N computing units 320 is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.

For example, each of the N computing units 320 is configured to determine the corresponding region of interest according to the received configuration information; perform pooling processing on the data in the corresponding region of interest to obtain the Output data of the region of interest.

Optionally, N computing units 320 represent all computing units of the computing device 300, that is, all computing units 320 of the computing device 300 participate in the pooling process.

Optionally, the N computing units 320 represent part of the computing units of the computing device 300, that is, part of the computing units 320 of the computing device 300 participate in the pooling process.

For example, if the total number of computing units included in the computing device 300 is S, then N is a positive integer less than or equal to S. For example, S is an integer greater than 1.

It should be understood that, in actual applications, it may be determined whether part or all of the computing units in the computing device 300 participate in operations according to application requirements.

Optionally, N is an integer greater than 1.

It should be understood that in this embodiment, the pooling processing of the region of interest greater than 1 can be realized by more than one calculation unit 320, that is, the parallel data processing of the ROI-pooling layer can be realized, so that the performance of the ROI-pooling layer can be improved. Processing efficiency.

It should be understood that N can also be equal to 1. For example, if only one region of interest needs to be pooled in the current computing task, the value of N is set to 1.

It should also be understood that regardless of the value of N, the computing device provided in the present application can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the ROI-pooling layer.

It should be understood that the function of each calculation unit 320 in the N calculation units 320 is the same, that is, the method for each calculation unit 320 to pool the region of interest is similar. In order to facilitate understanding and description, the function and operation of the calculation unit 320 are described by taking the first calculation unit 320 of the N calculation units 320 as an example. It should be understood that the description of the first calculation unit 320 herein may be adaptively applied to each calculation unit 320 of the N calculation units 320.

The first calculation unit 320 is configured to perform pooling processing on the first region of interest to obtain output data of the first region of interest. The first region of interest represents a region of interest corresponding to the first calculation unit 320 among the N regions of interest. The first calculation unit 320 performs pooling processing on the first region of interest, and the method for obtaining the output data of the first region of interest includes the following steps S410 to S440, as shown in FIG. 4.

S410: Obtain data of an input feature map, where the input feature map includes K regions of interest, and K is a positive integer not less than N.

Continuing to refer to FIG. 3, the computing device 300 further includes a data input interface 330 configured to read data input to the characteristic map from an external storage device. The first calculation unit 320 may obtain data of the input feature map from the data input interface 320. For example, the data input interface 330 is configured to send data of the input feature map to the first calculation unit 320.

S420: According to the position of the first region of interest and the resolution of the pooled output frame, obtain a data window area corresponding to the data to be output of the first region of interest on the first region of interest.

S430: Select data that falls into the data window area from the acquired data of the input feature map.

S440: Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area. Among them, the arithmetic processing can be the maximum value or the average value, and the specific processing method can be predefined.

It should be understood that when the first calculation unit 320 obtains the output data of all the data window regions of the first region of interest, it also obtains the output data of the first region of interest. Assuming that the resolution of the pooled output frame is E×F, the first region of interest corresponds to (E*F) pieces of output data.

The computing device provided in the present application can implement parallel data processing of the ROI-pooling layer, and compared with the prior art, can effectively improve the processing efficiency of the ROI-pooling layer under the premise of lower power consumption.

It should be understood that the method of dividing a region of interest into multiple data window regions based on the resolution of the pooled output frame is a prior art, which will not be described in detail herein.

Step S440 can be implemented in multiple implementation manners.

Optionally, step S440, performing arithmetic processing on the data falling in the data window area to obtain output data in the data window area includes: obtaining a column processing result of each row of data falling in the data window area; performing row processing on the column processing result Process to obtain the output data in the data window area.

It should be understood that if the operation mode (also referred to as the mapping relationship) of the pooling process is to find the maximum value, the operation mode corresponding to the column processing is the maximum value, and the operation mode corresponding to the row processing is also the maximum value. If the operation mode (also called the mapping relationship) of the pooling process is averaging, the operation mode corresponding to the column processing is the cumulative sum, and the operation mode corresponding to the row processing is the cumulative sum first, and then the average value.

Hereinafter, in conjunction with FIG. 5, an example in which the first calculation unit 320 performs pooling processing on the region of interest is described. In FIG. 5, the calculation method of the pooling process (also called the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2×2, and the area of interest is 8×4.

In the example of FIG. 5, the pooling processing of the region of interest by the first calculation unit 320 includes the following steps 1-1 and 1-2.

Step 1-1, based on the 2×2 resolution of the pooled output frame, determine 4 data window areas in the region of interest, as shown in Figure 5, data window area 1, data window area 2, and data window area 3. Data window area 4.

It should be understood that this step 1-1 corresponds to step S420 and step S430 in FIG. 4.

Step 1-2, perform maximum value processing on the data in the 4 data window areas respectively to obtain corresponding output data.

As shown in Figure 5, the data in data window area 1 is subjected to maximum value processing to obtain output data 29; the data in data window area 2 is subjected to maximum value processing to obtain output data 31; The data in the data window is subjected to maximum value processing to obtain the output data 30; the data in the data window area 4 is subjected to maximum value processing to obtain the output data 28.

Taking the data window area 1 as an example, in step 1-2, the maximum value processing is performed on the data in the data window area 1 to obtain the output data 29 of the data window area 1 may include the following sub-steps.

Sub-step 1-2-1, obtain the column processing result of each row of data that falls into the data window area 1.

For example, seek the maximum value of the first row of data {12,3} to obtain the column processing result of the first row of data 12; seek the maximum value of the second row of data {29,26} to obtain the column processing result of the second row of data 29; Obtain the maximum value of the third row of data {2,11} to obtain the column processing result of the third row of data 11; Obtain the maximum value of the fourth row of data {12,13} to obtain the column processing result of the fourth row of data 13.

In sub-step 1-2-2, row processing is performed on the column processing result of each row of data in the data window area 1, and the output data 29 of the data window area 1 is obtained.

For example, the maximum value of the column processing results {12, 29, 11, 13} of the 4 rows of data in the data window area 1 is obtained, and the row processing result is 29, that is, the output data 29 of the data window area 1 is obtained.

It should be understood that step 1-2 corresponds to step S440 in FIG. 4.

It should also be understood that in the example shown in FIG. 5, a feature map with a resolution of 8×4 is input, and a feature map with a resolution of 2×2 is output.

In the example of FIG. 5, there is no overlapping area between different data window areas. In some cases, there may be overlapping areas between different data window areas, as shown in Figure 6.

In FIG. 6, the calculation method of the pooling process (also referred to as the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2×2, and the area of interest is the resolution of 6×4. It can be seen from Figure 6 that there is an overlap area between the data window area 1 and the data window area 3 corresponding to the

output data

29 and 22, and there is an overlap area between the data window area 2 and the data window area 4 corresponding to the

output data

31 and 23, respectively. Overlapping area.

It should be understood that FIG. 6 is only an example and not a limitation. In practical applications, the overlapping area between the two data window areas may include one or more rows of data, or may include one or more columns of data.

If the overlap area between the two data window areas includes one or more rows of data, the overlap area may also be referred to as a row overlap area. If the overlapping area between two data window areas includes one or more columns of data, the overlapping area may also be referred to as a column overlapping area.

Optionally, step S440: performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows Window area, in the process of obtaining the output data of the first data window area, the processing result of the first column of the line overlapping area is cached; in the process of calculating the output data of the second data window area, the output data in the second data window area Column processing is performed on the row data except for the row overlap area to obtain a second column processing result, and row processing is performed on the second column processing result and the cached first column processing result to obtain the output data of the second data window.

For example, the first area of interest is the area of interest shown in FIG. 6, the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3. The data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data {{2, 11}, {12, 13}}.

In the example of FIG. 6, the process of obtaining the output data 29 falling in the data window area 1 by the first calculation unit 320 may include the following steps 2-1 to 2-3. The process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include the following steps 3-1 and 3-2.

Step 2-1: Perform column processing on each row of data that falls into the data window area 1, and obtain the column processing result of each row of data. In this example, the calculation method of pooling processing is to find the maximum value. Accordingly, the calculation methods of column processing and row processing are both to find the maximum value.

Referring to Figure 6, the maximum value of the first row of data {12,3} is obtained, and the column processing result 12 of the first row of data is obtained; the maximum value of the second row of data {29,26} is obtained to obtain the column of the second row of data Processing result 29; seeking the maximum value of the third row of data {2,11} to obtain the column processing result 11 of the third row of data; seeking the maximum value of the fourth row of data {12,13} to obtain the column of the fourth row of data Processing result 13.

In step 2-2, row processing is performed on the column processing result obtained in step 2-1 to obtain the row processing result, that is, the output data of the data window area 1 is obtained.

Referring to FIG. 6, the column processing result {12, 29, 11, 13} obtained in step 2-1 is maximized, and the row processing result 29 is obtained, that is, the output data 29 of the data window area 1 is obtained.

Step 2-3: Cache the column processing result {11,13} in the overlapping area of the line (that is, line 3 and line 4).

Optionally, step 2-3 may be included in step 2-1 or step 2-2.

Step 3-1: Perform column processing on the row data falling in the data window area 3 except for the row overlap area.

Refer to Figure 6, the maximum value of the third row of data {7,14} is obtained, and the column processing result 14 of the third row of data is obtained; the maximum value of the fourth row of data {22,4} is obtained, and the column of the fourth row of data is obtained Processing result 22.

Step 3-2, perform row processing on the column processing result {14,22} obtained in step 3-1 and the column processing result {11,13} of the row overlap region cached in step 2-3, to obtain the row processing result, namely Obtain the output data of the data window area 3.

That is, the maximum value of the column processing result {14, 22} obtained in step 3-1 and the column processing result {11, 13} of the row overlap region cached in step 2-3 is obtained, and the output data 22 of the data window area 3 is obtained.

In the above example in conjunction with FIG. 6, in the process of obtaining the output data of the data window area 3, the first calculation unit 320 omitted the first row of data {2,11} and the second row in the obtaining data window area 3. The read operation of the data {12,13} directly uses the column processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1. From the above step 2-1, it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data {2,11} and {12,13} in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.

It should be understood that this embodiment can avoid repeated reading of data, thereby saving bandwidth, and in addition, can also improve calculation efficiency.

It should be understood that the above step 2-1 to step 2-3, and step 3-1 to step 3-2 may be optional implementations of step S440.

Optionally, step S440: performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows In the window area, in the process of obtaining the output data of the first data window area, the row processing result of the first column of the processing result of the row overlap area is cached; in the process of calculating the output data of the second data window area, the second data Perform column processing on the row data in the window area except the row overlap area to obtain the second column processing result, and perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, Obtain the output data of the second data window.

As an example. See Figure 6. For example, the first area of interest is the area of interest shown in FIG. 6, the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3. The data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data {{2, 11}, {12, 13}}.

The process in which the first calculation unit 320 obtains the output data 29 falling in the data window area 1 may include steps 4-1 to 4-3. The process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include steps 5-1 and 5-2.

Among them, step 4-1 is the same as step 2-1 described above, and step 4-2 is the same as step 2-2 described above.

Step 4-3: Cache the row processing result {13} of the column processing result {11, 13} in the overlapping area of the row (ie, the 3rd row and the 4th row).

Optionally, step 4-3 may be included in step 4-2.

Step 5-1 is the same as step 3-1 described above.

Step 5-2, perform row processing on the column processing result {14,22} obtained in step 5-1 and the row processing result {13} of the row overlap area cached in step 4-3 to obtain the row processing result, that is, to obtain the data Output data of window area 3.

That is, the maximum value of the column processing result {14, 22} obtained in step 5-1 and the row processing result {13} of the row overlapping area cached in step 4-3 is obtained, and the output data 22 of the data window area 3 is obtained.

In this embodiment, in the process of obtaining the output data of the data window area 3, the first calculation unit 320 eliminates the need for the first row of data {2,11} and the second row of data {12, in the obtaining data window area 3. ,13}, but directly use the row processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1.

From the above step 4-1, it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data {2,11} and {12,13} in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.

In addition, in step 4-3 of this embodiment, only the row processing result {13} in the overlapping area (ie, the third row and the fourth row) needs to be cached. Relatively speaking, the cache requirement of the first computing unit 320 can be further reduced.

It should be understood that the foregoing step 4-1 to step 4-3, and step 5-1 to step 5-2 may be optional implementations of step S440.

As another example, see Figure 7. As shown in FIG. 7, the left side is a column of data on the region of interest, and the right side is a column of output data in the pooled output box (also referred to as an output feature map) corresponding to the region of interest. Among them, the pixel point a is calculated from the input data {1,2,3,4,5}, and the pixel point b is calculated from the input data {3,4,5,6,7}. When processing the input data line by line, while calculating the pixel a, the calculation result of {3,4,5} will be obtained. The calculation result of {3,4,5} is both the intermediate result of pixel a and the intermediate result of pixel b. In this case, in the process of calculating pixel a, the calculation result of {3,4,5} is cached, and when calculating pixel b, the calculation result of {3,4,5} can be directly read, based on The calculation result of {3,4,5} and {6,7} calculate the pixel point b.

It should be understood that this embodiment can save bandwidth and improve calculation efficiency by avoiding repeated reading of data, and in addition, can also reduce cache requirements.

As described above, in step S440, the data falling in the data window area is processed by the method of processing first and then processing to obtain the output data of the data window area. This application is not limited to this. For example, in step S440, it is also possible to perform arithmetic processing on the data falling in the data window area by processing first and then column processing to obtain the output data of the data window area.

Optionally, step S440 includes: obtaining a row processing result of each column of data falling in the data window area; performing column processing on the obtained row processing result to obtain output data in the data window area.

It should be understood that in the embodiment of performing arithmetic processing on the data falling into the data window area by processing first and then column processing, in the case of overlapping areas between different data window areas, it is also possible to avoid repeated reading. The data and the implementation manner are similar to the related description in the above embodiment, and will not be repeated here.

Optionally, as shown in FIG. 8, the first calculation unit 320 may include one or more arithmetic modules 324, and the arithmetic modules 324 are configured to perform arithmetic processing on the data falling in the data window area to obtain the data window area The output data.

For example, step S440 in the above embodiment is executed by the arithmetic module 324 in the first calculation unit 320.

Optionally, the first calculation unit 320 includes a plurality of arithmetic modules 324, wherein each arithmetic module 324 is configured to obtain one output data.

Optionally, the number of arithmetic modules 324 included in the first calculation unit 320 may be related to the width of the pooled output box.

For example, the resolution of the pooled output frame is E (height)×F (width), and the number of arithmetic modules 324 in the first calculation unit 320 may be F.

It should be understood that the number of computing modules in the first computing unit 320 may also be determined according to actual needs, which is not limited in this application. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing modules in the first computing unit 320. Wherein, the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.

Optionally, the first calculation unit 320 has a function that the calculation module can be tailored. In other words, the number of operation modules of the first calculation unit 320 may be dynamically changed.

Therefore, since the first calculation unit 320 provided in the present application has the function of cutting calculation modules, it is possible to flexibly adjust the number of calculation modules in the first calculation unit 320 based on calculation requirements, thereby improving calculation efficiency and computing resources. Utilization rate.

The first calculation unit 320 may also include a calculation control module configured to perform step S420 and step S430 in the above embodiment.

Optionally, the calculation control module and the arithmetic module 324 can be set separately or collectively. For example, the calculation module 324 includes the calculation control module.

Optionally, as shown in FIG. 8, the first calculation unit 320 may further include a sub-configuration interface 321, a sub-data interface 322, and a storage module 323.

The sub-configuration interface 321 is configured to receive configuration information from the configuration interface 310.

For example, both the configuration interface 310 and the sub-configuration interface 321 may be an advanced peripheral bus (APB) interface.

The sub-data interface 322 is configured to receive data of the input feature map from the data input interface 330.

The storage module 323 is configured to cache intermediate processing results.

It should be understood that the storage module 323 may be configured to buffer intermediate processing data that needs to be reused in the process of obtaining the output data of the first region of interest by the first calculation unit 320.

As an example, in the above-described embodiment in which different data window areas have overlapping areas, the storage module 323 may be configured to cache the data on the line overlapping areas of the data window area 1 and the data window area 3. Column processing result or row processing result.

Optionally, the storage module 323 may be located in the calculation module 324.

Optionally, the storage module 323 can also be configured to store the data of the input feature map received by the sub-data interface 322.

For example, the data of the input feature map stored in the storage module 323 can support the first calculation unit 320 to perform pooling processing of multiple regions of interest. In other words, during the pooling process of one or more regions of interest, the first calculation unit 320 can directly obtain the data of the input feature map from the storage module 323 without obtaining it from the outside.

FIG. 9 is another schematic block diagram of the first calculation unit 320. The first calculation unit 320 includes a sub-configuration interface 321, a sub-data interface 322, a storage module 323 and an arithmetic module 324. Among them, the arithmetic module 324 includes the control circuit part shown in the left half of the box marked 324 in FIG. 9 and the arithmetic circuit part shown in the right half of the box.

The sub-configuration interface 321 is configured to receive configuration information sent by the configuration interface 310, the configuration information indicating the location of the region of interest.

The sub-data interface 322 is configured to receive the data of the input characteristic map sent by the data input interface 330.

The storage module 323 is configured to buffer the data received by the sub-data interface 322, and can also be used to buffer intermediate processing results.

The control circuit part in the arithmetic module 324 is configured to execute step S420 and step S430 described in the above embodiment.

For example, the control circuit part in the arithmetic module 324 is configured to calculate the start coordinates (w_start_floor) and the end coordinates (w_end_ceil ).

The arithmetic circuit part in the arithmetic module 324 is configured to execute step S440 described in the above embodiment.

As shown in FIG. 9, the arithmetic circuit part includes a plurality of arithmetic circuits.

In the scenario where the calculation method of the pooling process is to find the maximum value, the calculation circuit is configured to have a comparison function. Take an arithmetic circuit as an example. The arithmetic circuit has one or more input terminals and an output terminal. The arithmetic circuit can compare the data input at the input terminal to obtain the maximum value, and output the maximum value to the output terminal. For example, the arithmetic circuit can be realized by a circuit composed of a comparison circuit or a comparison operator.

In the scenario where the calculation method of the pooling process is averaging, the arithmetic circuit is configured to have the functions of calculating accumulation and averaging. Take an arithmetic circuit as an example. The arithmetic circuit has one or more input terminals and an output terminal. The transport circuit can accumulate the data input at the input terminal to obtain the accumulated sum, and output the accumulated sum to the output terminal. The arithmetic circuit can also perform an averaging operation on the accumulated sum to obtain the average value, and output the average value to the output terminal. For example, the arithmetic circuit can be realized by an adder and a multiplier.

Continuing to refer to FIG. 3, the computing device 300 may include a data input interface 330. The configuration interface 310 is also configured to transmit configuration information indicating the starting position of the input feature map in the external storage device and the resolution of the input feature map to the data input interface 330. The data input interface 330 is configured to read the data of the input feature map from an external storage device according to the starting position and the resolution of the input feature map; and broadcast the read data of the input feature map to N computing units 320 in.

It should be understood that by broadcasting the read data of the input characteristic map to the N computing units 320 through the data input interface 330, so that the data input interface 330 performs a data reading operation from the external storage device once, so that the N computing units 320 can be shared. Obtain the read data, therefore, repeated reading of data can be avoided, and bandwidth can be saved.

The computing device 300 may be a system on a chip. It should be understood that the storage resources of the system-on-chip are generally small, and the data to be processed generally needs to be obtained from an external storage device. In this application, the computing device 300 may obtain the input data of the ROI-Pooling layer, that is, the data of the input feature map from an external storage device.

The external storage device may be Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM). DDR SDRAM can be referred to as DDR memory or DDR for short. It should be understood that this article does not limit the implementation of the external storage device.

The data input interface 330 may be configured to read data according to the data storage format of the external storage device.

Optionally, the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The input data interface 330 is configured to read input characteristic data from an external storage device in a burst of X bits×L, where L is a positive integer. For example, the value of X is 128.

As an example, in an external storage device, each row of data is stored in a 128-bit (bit) aligned manner. Assuming that the quantization bit width of the data is 8bit, every 16 data is aligned and stored in an address of the external storage device, and the insufficient data at the end of the line is filled with invalid data to ensure that the starting address of the next line is 128bit aligned. In this case, the data input interface 330 is configured to read the data in the external storage device in a burst of 128bit×L, that is, to access the external storage device in a burst mode, each time the external storage is accessed. Devices are based on the granularity of 128bit. Among them, L is a positive integer. For example, L is a positive integer less than or equal to 8, that is, the data input interface 330 accesses data of 8 addresses at most in each burst.

It should be understood that the manner in which the computing device provided in the present application reads data from the external storage device can be adapted to the data storage format of the external storage device, thereby improving the efficiency of data reading.

In the case where the input feature map is three-dimensional, for example, the input feature map is an input cube as shown in FIG. 1, and the data input interface 330 reads the input feature map in sequence, one by one.

Optionally, the computing device 300 further includes a cache unit (not shown in FIG. 3). The data input interface 330 is configured to: read the data of the input feature map in parallel from the external storage device in line-major order; cache the data of the input feature map read in parallel in the cache unit; The data of the graph undergoes parallel-serial conversion processing; the data of the input characteristic map obtained by the parallel-serial conversion processing is broadcast to N computing units 320.

Among them, the data of the input feature map is read in parallel from the external storage device in the main sequence of rows, which means that the data of the input feature map is read in parallel from the external storage device with the behavior granularity.

For example, the data input interface 330 may be configured to read data in a zigzag scanning manner for an input feature map.

For example, the data input interface 330 may broadcast the data of the input feature map obtained by the parallel-serial conversion process to the N computing units 320 in raster order.

It should be understood that the data input interface 330 adopts a buffer unit to buffer the data of the input feature map. While realizing data buffering, it can also solve the problem of internal and external data processing speed mismatch.

The buffer unit may be located in the data input interface 330.

Optionally, the buffer unit may be a first input first output (first input first output, FIFO) module. The FIFO module can be a FIFO memory or a FIFO queue.

The data bit width of the FIFO module can be designed according to the data storage format of the external storage device.

For example, in an external storage device, each row of data is stored in a 128-bit (bit) alignment manner, and the data bit width in the FIFO module is 128 bits.

It should be understood that if the data storage format of the external storage device is such that each row of data is stored in a 128-bit (bit) alignment, and the insufficient data at the end of the row is filled with invalid data to ensure that the starting address of the next row is 128-bit aligned, then the data input interface 330 can eliminate the filled invalid data while converting the data stored in the storage unit from parallel to serial processing.

The buffer unit can be provided separately from the data input interface 330, or can be integrated. For example, the buffer unit may be located in the data input interface 330.

Optionally, the number of computing units 320 included in the computing device 300 may be related to the granularity of the data read by the data input interface 330 and the number of pixels processed by the computing unit 320 in each clock cycle.

As an example, assuming that the calculation unit 320 processes 1 pixel in each clock cycle, and if the granularity of the data read by the data input interface 330 is 128 bits (assuming the quantization bit width of the data is 8 bits), the calculation device 300 includes The number of calculation units 320 can be set to 16, that is, the value of S can be 16.

As another example, suppose the calculation unit 320 processes 1 pixel in each clock cycle, and if the granularity of the data read by the data input interface 330 is 128 bits×8 (assuming the quantization bit width of the data is 8 bits), then calculate The number of calculation units 320 included in the device 300 may be set to 128, that is, the value of S may be 128.

It should be noted that, in this application, the number of computing units 320 included in the computing device 300 may be determined according to actual requirements. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing units 320 included in the computing device 300. Wherein, the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.

Optionally, the computing device 300 has a function that the computing unit 320 can tailor. In other words, the number of computing units 320 in the computing device 300 can be dynamically changed.

For example, when there are fewer regions of interest to be processed, the computing device 300 can be set to have a smaller number of computing units 320; when there are many regions of interest to be processed, the computing device 300 can be set to have more computing units. The number of calculation units 320.

It should be understood that the computing device 300 has a tailorable function of the computing unit 320, so that the computing device 300 can be flexibly adapted to ROI-pooling layers with different computing requirements.

It should be understood that the more computing units 320 included in the computing device 300, the larger the storage space required by the computing device 300, and the larger the overall volume of the computing device 300; the fewer computing units 320 included in the computing device 300, the greater the storage space required by the computing device 300. The smaller the storage space required by the device 300, the smaller the overall volume of the computing device 300. Therefore, since the computing device 300 provided in the present application has the function of tailoring the computing unit 320, the number of computing units 320 can be flexibly adjusted based on computing requirements, which can effectively save resource occupation while ensuring computing performance.

It should also be understood that, based on application requirements, the number S of computing units 320 included in the computing device 300 may also be one.

Continuing to refer to FIG. 3, optionally, the computing device 300 further includes a data output interface 340 configured to output the output data calculated by the N computing units 320 to an external storage device.

Optionally, the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The data output interface 330 is configured to splice the output data of the S computing units into X bits for alignment buffering, and output the aligned buffered data to an external storage device.

Optionally, N is an integer greater than 1. As shown in FIG. 10, the computing device 300 may further include an arbitration unit 350 configured to sequentially transmit the output data calculated by the N computing units 320 to the data output in a preset order. Interface 340.

For example, the arbitration unit 350 may be configured to transmit the aligned data to the data output interface 340 using a fair polling algorithm.

It should be understood that for the output data of the multiple computing units 320, the arbitration unit 350 is used to transmit it to the data output interface 340 in a preset order, which is beneficial to the management of the data in the subsequent process.

Optionally, the configuration interface 310 is configured to, after the computing device 300 completes the pooling process of the N regions of interest, transmit data indicating the positions of the P regions of interest to the P computing units 320 of the S computing units 320 In the configuration information, the P regions of interest correspond to the P calculation units 320 one-to-one, and P is a positive integer less than or equal to S.

When the number of regions of interest on an input feature map is greater than N, the P regions of interest are regions of interest on the current input feature map that have not been pooled.

In the case where the input data of the ROI-pooling layer is multiple input feature maps (L as shown in FIG. 1), the P regions of interest may be regions of interest on the next input feature map.

In the case where the number of regions of interest on an input feature map is greater than N, and the input data of the ROI-pooling layer is multiple input feature maps, the P regions of interest can be unprocessed on the current input feature map The region of interest for pooling processing can alternatively be the region of interest on the next input feature map.

In other words, after the execution of each configuration instruction is completed, whether the next instruction is to preferentially switch to the next input feature map or to preferentially switch the P regions of interest on the current input feature map that have not been pooled can be dynamically configured by the instruction. In practical applications, the calculation rate and bandwidth requirements of the two switching sequences can be analyzed according to actual needs, so as to select a better switching sequence.

Based on the above description, the computing device provided by the present application can support parallel pooling processing of multiple regions of interest by including multiple computing units, and therefore, can improve the processing efficiency of the ROI-pooling layer.

It should be understood that the unit of the resolution or size of the image or region mentioned in this article is all pixels. For example, in Figure 2, the resolution of the input feature map is 8×8 (unit: pixel), the resolution of the region of interest is 7×5 (unit: pixel), and the size of the output pooling frame is 2×2 ( Unit: pixels).

The computing device 300 provided in this application may be an application specific integrated circuit (ASIC) or a field-programmable gate array (FPGA).

The computing device 300 provided in this application can be applied to implement the hard acceleration function of the ROI-pooling layer in a convolution neural network (CNN).

As an example, the computing device 300 can be applied to an intellectual property (IP) core and a cooperative working circuit between the IP core.

As another embodiment, the computing device 300 may also be applied to other types of neural network accelerators or processors that include the ROI-pooling layer.

The device embodiments provided by the present application are described above, and the method embodiments provided by the present application will be described below. It should be understood that the description of the method embodiment and the description of the device embodiment correspond to each other. Therefore, for the content that is not described in detail, please refer to the above device embodiment. For brevity, details are not repeated here.

FIG. 11 is a schematic flowchart of a method for calculating an ROI-pooling layer provided by an embodiment of the application. For example, the calculation method may be executed by the calculation device 300 of the above embodiment. The calculation method includes the following steps S1110 and S1120.

S1110. Acquire configuration information indicating the positions of N regions of interest, where N is a positive integer.

S1120: Perform parallel pooling processing on the N regions of interest according to the configuration information to obtain output data of the corresponding regions of interest.

In step S1120, the pooling process for N regions of interest is included, wherein the method for pooling the first region of interest includes steps S410 to S440 as described above. For the sake of brevity, I won't repeat them here. It should be understood that the first region of interest represents each of the N regions of interest.

Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: obtaining the column processing result of each row of data falling in the data window area; performing row processing on the column processing result to obtain Output data in the data window area.

Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the processing result of the first column of the line overlap area; in the process of calculating the output data of the second data window area, remove the line overlap area in the second data window area Perform column processing on the other row data to obtain the second column processing result, perform row processing on the second column processing result and the cached first column processing result, to obtain the output data of the second data window.

Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the row processing result of the first column of the processing result of the line overlap area; in the process of calculating the output data of the second data window area, the data in the second data window area Perform column processing on the row data other than the row overlap area to obtain the second column processing result, perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, to obtain the second column processing result. The output data of the data window.

Optionally, in the embodiment shown in FIG. 11, step S1120 includes: performing parallel pooling processing on N regions of interest by N computing units in a computing device including S computing units, where N computing units There is a one-to-one correspondence with N regions of interest, S is an integer greater than 1, and N is an integer less than S.

It can be understood that, in addition to using N computing units to perform parallel pooling processing on N regions of interest at the hardware level, it can also be performed at the software level to improve the processing efficiency of the regions of interest, which is not specifically limited here.

The computing device is, for example, the computing device 300 in the above embodiment. The N calculation units are, for example, the N calculation units 320 in the above embodiment.

For example, the calculation unit processes one pixel in each clock cycle.

Optionally, the calculation unit includes an arithmetic module for performing arithmetic processing on the data falling into the corresponding data window.

For example, the number of arithmetic modules included in the calculation unit is related to the width of the pooled output box.

Wherein, the operation module is, for example, the operation module 324 in the above embodiment.

Optionally, in the embodiment shown in FIG. 11, the method further includes performing the following steps through the data input interface: obtaining the starting position of the input feature map in the external storage device, and the configuration indicating the resolution of the input feature map Information; according to the starting position and the resolution of the input feature map, read the data of the input feature map from an external storage device, and broadcast the read data of the input feature map to the N computing units.

Wherein, the data input interface is, for example, the data input interface 330 in the above embodiment.

Optionally, in the embodiment shown in FIG. 11, according to the starting position and the resolution of the input feature map, reading the data of the input feature map from the external storage device includes: in line main sequence, from the external storage device Read the data of the input feature map in parallel. Broadcasting the read data of the input feature map to the N computing units includes: buffering the data of the input feature map read in parallel in a cache unit; performing processing on the data of the input feature map in the cache unit Parallel-serial conversion processing; broadcasting the data of the input feature map obtained by the parallel-serial conversion processing to the N computing units.

Optionally, the caching unit and the computing unit are separated, and at the same time, the caching unit can be provided separately from the data input interface, or can be integrated. For example, the buffer unit may be located in the data input interface.

For example, the number S of computing units included in the computing device is related to the granularity of the data read by the data input interface and the number of pixels processed by the computing unit in each clock cycle.

Optionally, in the embodiment shown in FIG. 11, the calculation method further includes: outputting the output data of the N regions of interest to the external storage device through the data output interface.

For example, the calculation method further includes: sequentially transmitting the output data of the N regions of interest to the data output interface in a preset order through the arbitration unit.

The data output interface is, for example, the data output interface 340 in the above embodiment, and the arbitration unit is, for example, the arbitration unit 350 in the above embodiment.

Optionally, in the embodiment shown in FIG. 11, after the pooling process of the N regions of interest is completed, the calculation method further includes: acquiring configuration information indicating the positions of the P regions of interest, where P senses The region of interest is the region of interest that has not been pooled on the current input feature map, or P regions of interest are the regions of interest on the next input feature map, and P is a positive integer.

It is described above that in the embodiment shown in FIG. 11, step S1120 is implemented by N computing units. Optionally, this step S1120 can also be implemented by software.

FIG. 12 is a schematic block diagram of a neural network system 1200 provided by an embodiment of the application. The neural network system 1200 includes a region-of-interest-pooling layer computing device 1210, and the computing device 1210 is the computing device 300 in the above embodiment.

It should be understood that the neural network system 1200 may also include other neural network layer computing devices 1220.

For example, the computing device 1220 includes any one or more of the following computing devices: a computing device in a convolutional layer, a computing device in an activation layer, a computing device in a pooling layer, and a computing device in a fully connected layer.

The computing devices mentioned herein can also be referred to as hardware accelerators.

It can be understood that the calculation method of the ROI-pooling layer and the beneficial effects of the neural network system provided in this article can refer to the description of the calculation device of the region of interest-pooling layer in the above embodiment, and will not be repeated here.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc. .

A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A computing device for a region of interest-pooling layer, wherein the computing device includes a configuration interface and S computing units, and the S is an integer greater than 1;

The configuration interface is configured to transmit configuration information indicating positions of N regions of interest to N computing units of the S computing units, wherein the N regions of interest and the N computing units One-to-one correspondence, the N is a positive integer less than or equal to the S;

Each of the N computing units is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
The computing device according to claim 1, wherein the first computing unit of the N computing units is configured to perform pooling processing on the first region of interest to obtain the value of the first region of interest Output Data;

Wherein, the performing pooling processing on the first region of interest to obtain output data of the first region of interest includes:

Acquiring data of an input feature map, the input feature map including K regions of interest, and the K is a positive integer not less than the N;

Obtaining, according to the position of the first region of interest and the resolution of the pooled output frame, a data window area corresponding to the data to be output of the first region of interest on the first region of interest;

Selecting data that falls into the data window area from the acquired data of the input feature map;

Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area.
The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:

Acquiring a column processing result of each row of data falling in the data window area;

Perform row processing on the column processing result to obtain output data in the data window area.
The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:

If the first region of interest includes a first data window region and a second data window region having a line overlapping region, in the process of obtaining the output data of the first data window region, buffering the line overlapping region The first column of processing results;

In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the The processing result of the second column and the buffered processing result of the first column are processed to obtain the output data of the second data window.
The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises: comprising:

If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached The row processing result of a column of processing results;

In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the Row processing is performed on the second column processing result and the cached row processing result of the first column processing result in the row overlap area to obtain the output data of the second data window.
The computing device according to any one of claims 1 to 5, wherein the computing device further comprises a data input interface;

Wherein, the configuration interface is further configured to transmit configuration information indicating the starting position of the input feature map in the external storage device and the resolution of the input feature map to the data input interface;

The data input interface is configured as:

Reading the data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;

Broadcast the read data of the input feature map to the N computing units.
The computing device according to claim 6, wherein the computing device further comprises a cache unit;

Wherein, the data input interface is configured as:

Read the data of the input feature map in parallel from the external storage device in a row-major sequence;

Buffer the data of the input feature map read in parallel in the buffer unit;

Performing parallel-serial conversion processing on the data of the input feature map in the cache unit;

Broadcasting the data of the input feature map obtained by the parallel-serial conversion process to the N computing units.
The computing device according to claim 6 or 7, wherein the computing device further comprises:

The data output interface is configured to output the output data calculated by the N calculation units to the external storage device.
The computing device according to claim 8, wherein the computing device further comprises:

The arbitration unit is configured to sequentially transmit the output data calculated by the N calculation units to the data output interface in a preset order.
The computing device according to any one of claims 6 to 9, wherein the granularity of the data read by the S and the data input interface and the number of pixels processed by the computing unit in each clock cycle Related.
11. The computing device of claim 10, wherein the computing unit processes one pixel in each clock cycle.
The computing device according to any one of claims 2 to 5, wherein the first computing unit comprises:

The arithmetic module is configured to perform the arithmetic processing on the data falling in the data window area to obtain the output data of the data window area.
The computing device according to claim 12, wherein the number of the computing modules is related to the width of the pooled output frame.
The computing device according to any one of claims 2 to 5, wherein the first computing unit further comprises:

The storage module is configured to buffer the received data of the input feature map.
The computing device according to any one of claims 1 to 14, wherein the configuration interface is configured to:

After the computing device completes the pooling processing of the N regions of interest, it transmits configuration information indicating the positions of the P regions of interest to the P computing units of the S computing units, and the P sensors The region of interest has a one-to-one correspondence with the P calculation units, and the P is a positive integer less than or equal to the S;

Wherein, the P regions of interest are regions of interest that have not been pooled on the current input feature map, or the P regions of interest are regions of interest on the next input feature map.
The computing device according to any one of claims 1 to 15, wherein the computing device is an application specific integrated circuit ASIC or a field programmable gate array FPGA.
A calculation method for a region of interest-pooling layer is characterized in that it includes:

Acquiring configuration information indicating positions of N regions of interest on the input feature map, where N is a positive integer;

According to the configuration information, parallel pooling processing is performed on the N regions of interest to obtain output data of the corresponding regions of interest.
The calculation method according to claim 17, wherein performing parallel pooling processing on the N regions of interest to obtain output data of the corresponding regions of interest comprises:

Performing pooling processing on the first region of interest to obtain output data of the first region of interest;

Wherein, the performing pooling processing on the first region of interest to obtain output data of the first region of interest includes:

Acquiring data of an input feature map, the input feature map including K regions of interest, and the K is a positive integer not less than the N;

Obtaining, according to the position of the first region of interest and the resolution of the pooled output frame, a data window area corresponding to the data to be output of the first region of interest on the first region of interest;

Selecting data that falls into the data window area from the acquired data of the input feature map;

Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area.
The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:

Acquiring a column processing result of each row of data falling in the data window area;

Perform row processing on the column processing result to obtain output data in the data window area.
The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:

If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached A list of processing results;

In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the The processing result of the second column and the buffered processing result of the first column are processed to obtain the output data of the second data window.
The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:

If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached The row processing result of a column of processing results;

In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the Row processing is performed on the second column processing result and the cached row processing result of the first column processing result in the row overlap area to obtain the output data of the second data window.
The calculation method according to any one of claims 17 to 21, wherein the performing parallel pooling processing on the N regions of interest includes:

Parallel pooling is performed on the N regions of interest by N computing units in a computing device including S computing units, where the N computing units correspond to the N regions of interest in a one-to-one correspondence, so The S is an integer greater than 1, and the N is less than or equal to the S.
The calculation method according to claim 22, wherein the calculation method further comprises:

Through the data input interface:

Acquiring configuration information indicating the starting position of the input feature map in the external storage device and indicating the resolution of the input feature map;

Reading the data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;

Broadcast the read data of the input feature map to the N computing units.
22. The calculation method of claim 23, wherein the data of the input feature map is read from the external storage device according to the starting position and the resolution of the input feature map, include:

Read the data of the input feature map from the external storage device in parallel according to the starting position and the resolution of the input feature map in a row-major sequence;

The broadcasting the read data of the input feature map to the N computing units includes:

Buffering the data of the input feature map read in parallel in the buffer unit;

Performing parallel-serial conversion processing on the data of the input feature map in the cache unit;

Broadcasting the data of the input feature map obtained by the parallel-serial conversion process to the N computing units.
The calculation method according to any one of claims 22 to 24, wherein the granularity of the data read by the S and the data input interface and the number of pixels processed by the calculation unit in each clock cycle Related.
The calculation method according to claim 25, wherein the calculation unit processes one pixel in each clock cycle.
The calculation method according to any one of claims 22 to 26, wherein the calculation unit comprises an arithmetic module for performing arithmetic processing on the data falling in the corresponding data window in the region of interest.
The calculation method according to claim 27, wherein the number of the calculation modules included in the calculation unit is related to the width of the pooled output frame.
The calculation method according to any one of claims 22 to 28, wherein the calculation unit further comprises a cache module, configured to cache the acquired data of the input feature map.
The calculation method according to any one of claims 17 to 29, wherein the calculation method further comprises:

The output data of the N regions of interest are output to an external storage device through the data output interface.
The calculation method according to claim 30, wherein the calculation method further comprises:

Through the arbitration unit, the output data of the N regions of interest are sequentially transmitted to the data output interface in a preset order.
The calculation method according to any one of claims 17 to 31, characterized in that, after the pooling processing of the N regions of interest is completed, the calculation method further comprises:

Acquire configuration information indicating the positions of P regions of interest, where the P regions of interest are regions of interest that have not been pooled on the current input feature map, or the P regions of interest are lower A region of interest on an input feature map, where P is a positive integer.
The calculation method according to any one of claims 22 to 29, wherein the calculation device is an application specific integrated circuit ASIC or a field programmable gate array FPGA.
A neural network system is characterized in that it includes:

The region of interest-pooling layer computing device according to any one of claims 1 to 16.