WO2021092941A1 - Roi-pooling layer computation method and device, and neural network system - Google Patents

Roi-pooling layer computation method and device, and neural network system Download PDF

Info

Publication number
WO2021092941A1
WO2021092941A1 PCT/CN2019/118933 CN2019118933W WO2021092941A1 WO 2021092941 A1 WO2021092941 A1 WO 2021092941A1 CN 2019118933 W CN2019118933 W CN 2019118933W WO 2021092941 A1 WO2021092941 A1 WO 2021092941A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
interest
window area
feature map
region
Prior art date
Application number
PCT/CN2019/118933
Other languages
French (fr)
Chinese (zh)
Inventor
谷骞
高明明
杨康
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN201980039309.2A priority Critical patent/CN112313673A/en
Priority to PCT/CN2019/118933 priority patent/WO2021092941A1/en
Publication of WO2021092941A1 publication Critical patent/WO2021092941A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • This application relates to the field of data processing, and more specifically, to a calculation method and device for a region of interest-pooling layer, and a neural network system.
  • CNN convolution neural network
  • CNN is composed of several pre-defined basic layers, including convolutional layer, activation layer, pooling layer, fully connected layer, etc., where the pooling layer can include the region of interest (ROI) -Pooling layer (ROI-pooling layer).
  • ROI region of interest
  • the data processing of the region of interest-pooling layer is implemented by a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform.
  • the region of interest-pooling layer is computationally intensive.
  • the computing throughput rate of the CPU computing platform is not high, and it cannot meet the computing performance requirements of the region of interest-the pooling layer.
  • the power consumption of the GPU computing platform is too high. It can be seen that traditional CPU or GPU computing solutions cannot achieve a balance between computing performance and power consumption.
  • the present application provides a calculation method and device for a region of interest-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the region of interest-pooling layer without causing large power consumption.
  • the first aspect provides a computing device for a region of interest-pooling layer.
  • the computing device includes a configuration interface and S computing units, where S is an integer greater than one.
  • the configuration interface is configured to transmit configuration information indicating the positions of N regions of interest to N computing units of S computing units, where N regions of interest correspond to N computing units one-to-one, and N is less than Or a positive integer equal to S.
  • Each of the N computing units is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
  • the second aspect provides a calculation method for the region of interest-pooling layer.
  • the calculation method includes: obtaining configuration information indicating the positions of N regions of interest, where N is a positive integer; according to the configuration information, performing parallel pooling processing on the N regions of interest to obtain output data of the corresponding regions of interest.
  • a third aspect provides a neural network system, which includes the region of interest-pooling layer computing device of the first aspect.
  • the computing device provided by the present application includes multiple computing units, which can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the region of interest-pooling layer.
  • Figure 1 is a schematic diagram of the function of the region of interest-the pooling layer.
  • Figure 2 is a schematic diagram of the region of interest-pooling.
  • Fig. 3 is a schematic block diagram of a computing device according to an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of obtaining output data of a data window area in an embodiment of the application.
  • Figure 5 is another schematic diagram of the region of interest-pooling.
  • Figure 6 is another schematic diagram of the region of interest-pooling.
  • Figure 7 is another schematic diagram of the region of interest-pooling.
  • FIG. 8 is a schematic block diagram of a computing unit according to an embodiment of the application.
  • FIG. 9 is another schematic block diagram of a computing unit according to an embodiment of the application.
  • Fig. 10 is another schematic block diagram of a computing device according to an embodiment of the present application.
  • Fig. 11 is a schematic flowchart of a calculation method for a region of interest-pooling layer according to an embodiment of the present application.
  • Fig. 12 is a schematic block diagram of a neural network system according to an embodiment of the present application.
  • the following first introduces the related concepts of the region of interest-pooling layer (hereinafter referred to as the ROI-pooling layer).
  • the function of the ROI-pooling layer is to down-sample the region of interest (ROI) in the feature map.
  • ROI region of interest
  • the input data (input feature map, IFM) of the ROI-pooling layer is the output of the previous layer.
  • the input data of the ROI-pooling layer can be an array composed of a feature map, or a 3D array composed of multiple feature maps. As shown in Figure 1, the input data of the ROI-pooling layer is L feature maps, and the resolution of each feature map is H (height) ⁇ W (width).
  • the output data (output feature map, OFM) of the ROI-pooling layer is composed of several cubes. As shown on the right side of Figure 1, there are a total of M cubes. The number of cubes output by the ROI-pooling layer is determined by the number of cubes in the input feature map. The number of regions of interest (ROI) is determined.
  • OFM output feature map
  • each cube is composed of L output feature maps.
  • the resolution of each output feature map is the same.
  • the resolution of each output feature map in the cube is E (height) ⁇ F (width).
  • the function of the ROI-pooling layer is to downsample the region of interest in the input feature map. For example, in the example in Figure 1, taking a feature map in L feature maps as an example, the ROI-pooling layer downsamples the input feature map with a resolution of H ⁇ W into an output with a resolution of E ⁇ F Feature map.
  • the resolution of the feature map output by the ROI-pooling layer may be predefined.
  • the resolution E ⁇ F of the output cube may be predefined.
  • the mapping method of the ROI-pooling layer pooling processing can also be defined in advance, and generally there are two types: maximum (max) or average (avg).
  • the feature of the ROI-pooling layer is that the size of the region of interest to be pooled may not be fixed, and the size of the output feature map corresponding to each region of interest is fixed.
  • the calculation process of the ROI-pooling layer is: according to the resolution of the output cube E ⁇ F, the position of the region of interest in the input feature map, point by point inference of the output data in the input The corresponding data window area on the feature map; performing arithmetic processing on the data in the data window area to obtain output data corresponding to the data window area.
  • the arithmetic processing here can be the maximum value or the average value.
  • the region of interest represents the region to be pooled (ie, down-sampling) on the input feature map.
  • the resolution of the pooled output frame indicates the resolution of the feature map obtained after the region of interest is pooled.
  • the resolution of the pooled output frame is the resolution E ⁇ F of the output cube shown in FIG. 1.
  • the pooled output box can also be referred to as a feature map obtained after the region of interest is pooled.
  • the data window area represents the area corresponding to the output data corresponding to a certain area of interest on the area of interest.
  • the output data is obtained by pooling data in a certain sub-region in the region of interest.
  • a certain sub-area here can be recorded as the data window area.
  • the resolution of the pooled output frame is 2 ⁇ 2, which means that the resolution of the feature map obtained after any region of interest in the input feature map is pooled is 2 ⁇ 2, or the input feature map Any one of the regions of interest corresponds to 4 output data.
  • a region of interest surrounded by rows 4 to 8, and columns 1 to 7 in the input feature map is shown.
  • the four output data corresponding to the region of interest are C1, C2, C3, and C4, where the data window area corresponding to C1 on the region of interest is A1, and the data window area corresponding to C2 on the region of interest is The data window area corresponding to A2 and C3 on the area of interest is A3, and the data window area corresponding to C4 on the area of interest is A4.
  • C1 is obtained by performing arithmetic processing on the data in the data window area A1 (maximum value or average value)
  • C2, C3, and C4 are obtained by analogy.
  • FIG. 2 is only an example and not a limitation. In practical applications, one input feature map may include multiple regions of interest.
  • a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform is used to process the calculation of the ROI-pooling layer.
  • the computing efficiency of the CPU computing platform cannot meet the computing performance requirements of the ROI-pooling layer, and the GPU computing platform consumes a lot of power.
  • This application proposes a calculation method and device for the ROI-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the ROI-pooling layer without causing large power consumption.
  • FIG. 3 is a schematic block diagram of a computing device 300 of the ROI-pooling layer provided by this application.
  • the computing device 300 includes a plurality of computing units 320, and each computing unit 320 has a function of performing pooling processing on a region of interest. It should be understood that through the multiple computing units 320, parallel pooling processing of multiple regions of interest can be implemented. In other words, the computing device provided by the present application can implement parallel data processing of the ROI-pooling layer, thereby improving the processing efficiency of the ROI-pooling layer.
  • the computing device 300 further includes a configuration interface 310 configured to transmit configuration information indicating the positions of the N regions of interest to the N computing units 320, where the N regions of interest and the N computing units 320 One-to-one correspondence.
  • the configuration information may indicate the positions of the N regions of interest on the input feature map.
  • the configuration information may indicate the coordinates of the pixel points in the N regions of interest.
  • the position of the region of interest includes the coordinates of the first pixel at the upper left corner of the region of interest and the size of the region of interest.
  • the position of the region of interest includes the coordinates of all pixels in the region of interest.
  • the configuration interface 310 may be an advanced peripheral bus (APB) interface.
  • API advanced peripheral bus
  • Each of the N computing units 320 is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
  • each of the N computing units 320 is configured to determine the corresponding region of interest according to the received configuration information; perform pooling processing on the data in the corresponding region of interest to obtain the Output data of the region of interest.
  • N computing units 320 represent all computing units of the computing device 300, that is, all computing units 320 of the computing device 300 participate in the pooling process.
  • the N computing units 320 represent part of the computing units of the computing device 300, that is, part of the computing units 320 of the computing device 300 participate in the pooling process.
  • N is a positive integer less than or equal to S.
  • S is an integer greater than 1.
  • N is an integer greater than 1.
  • the pooling processing of the region of interest greater than 1 can be realized by more than one calculation unit 320, that is, the parallel data processing of the ROI-pooling layer can be realized, so that the performance of the ROI-pooling layer can be improved. Processing efficiency.
  • N can also be equal to 1. For example, if only one region of interest needs to be pooled in the current computing task, the value of N is set to 1.
  • the computing device provided in the present application can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the ROI-pooling layer.
  • each calculation unit 320 in the N calculation units 320 is the same, that is, the method for each calculation unit 320 to pool the region of interest is similar.
  • the function and operation of the calculation unit 320 are described by taking the first calculation unit 320 of the N calculation units 320 as an example. It should be understood that the description of the first calculation unit 320 herein may be adaptively applied to each calculation unit 320 of the N calculation units 320.
  • the first calculation unit 320 is configured to perform pooling processing on the first region of interest to obtain output data of the first region of interest.
  • the first region of interest represents a region of interest corresponding to the first calculation unit 320 among the N regions of interest.
  • the first calculation unit 320 performs pooling processing on the first region of interest, and the method for obtaining the output data of the first region of interest includes the following steps S410 to S440, as shown in FIG. 4.
  • S410 Obtain data of an input feature map, where the input feature map includes K regions of interest, and K is a positive integer not less than N.
  • the computing device 300 further includes a data input interface 330 configured to read data input to the characteristic map from an external storage device.
  • the first calculation unit 320 may obtain data of the input feature map from the data input interface 320.
  • the data input interface 330 is configured to send data of the input feature map to the first calculation unit 320.
  • S430 Select data that falls into the data window area from the acquired data of the input feature map.
  • S440 Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area.
  • the arithmetic processing can be the maximum value or the average value, and the specific processing method can be predefined.
  • the first calculation unit 320 obtains the output data of all the data window regions of the first region of interest, it also obtains the output data of the first region of interest. Assuming that the resolution of the pooled output frame is E ⁇ F, the first region of interest corresponds to (E*F) pieces of output data.
  • the computing device provided in the present application can implement parallel data processing of the ROI-pooling layer, and compared with the prior art, can effectively improve the processing efficiency of the ROI-pooling layer under the premise of lower power consumption.
  • Step S440 can be implemented in multiple implementation manners.
  • step S440 performing arithmetic processing on the data falling in the data window area to obtain output data in the data window area includes: obtaining a column processing result of each row of data falling in the data window area; performing row processing on the column processing result Process to obtain the output data in the data window area.
  • the operation mode (also referred to as the mapping relationship) of the pooling process is to find the maximum value
  • the operation mode corresponding to the column processing is the maximum value
  • the operation mode corresponding to the row processing is also the maximum value.
  • the operation mode (also called the mapping relationship) of the pooling process is averaging
  • the operation mode corresponding to the column processing is the cumulative sum
  • the operation mode corresponding to the row processing is the cumulative sum first, and then the average value.
  • the calculation method of the pooling process (also called the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2 ⁇ 2, and the area of interest is 8 ⁇ 4.
  • the pooling processing of the region of interest by the first calculation unit 320 includes the following steps 1-1 and 1-2.
  • Step 1-1 based on the 2 ⁇ 2 resolution of the pooled output frame, determine 4 data window areas in the region of interest, as shown in Figure 5, data window area 1, data window area 2, and data window area 3. Data window area 4.
  • step 1-1 corresponds to step S420 and step S430 in FIG. 4.
  • Step 1-2 perform maximum value processing on the data in the 4 data window areas respectively to obtain corresponding output data.
  • the data in data window area 1 is subjected to maximum value processing to obtain output data 29; the data in data window area 2 is subjected to maximum value processing to obtain output data 31; The data in the data window is subjected to maximum value processing to obtain the output data 30; the data in the data window area 4 is subjected to maximum value processing to obtain the output data 28.
  • step 1-2 the maximum value processing is performed on the data in the data window area 1 to obtain the output data 29 of the data window area 1 may include the following sub-steps.
  • Sub-step 1-2-1 obtain the column processing result of each row of data that falls into the data window area 1.
  • sub-step 1-2-2 row processing is performed on the column processing result of each row of data in the data window area 1, and the output data 29 of the data window area 1 is obtained.
  • the maximum value of the column processing results ⁇ 12, 29, 11, 13 ⁇ of the 4 rows of data in the data window area 1 is obtained, and the row processing result is 29, that is, the output data 29 of the data window area 1 is obtained.
  • step 1-2 corresponds to step S440 in FIG. 4.
  • the calculation method of the pooling process (also referred to as the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2 ⁇ 2, and the area of interest is the resolution of 6 ⁇ 4. It can be seen from Figure 6 that there is an overlap area between the data window area 1 and the data window area 3 corresponding to the output data 29 and 22, and there is an overlap area between the data window area 2 and the data window area 4 corresponding to the output data 31 and 23, respectively. Overlapping area.
  • FIG. 6 is only an example and not a limitation.
  • the overlapping area between the two data window areas may include one or more rows of data, or may include one or more columns of data.
  • the overlap area between the two data window areas includes one or more rows of data
  • the overlap area may also be referred to as a row overlap area.
  • the overlapping area between two data window areas includes one or more columns of data
  • the overlapping area may also be referred to as a column overlapping area.
  • step S440 performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows Window area, in the process of obtaining the output data of the first data window area, the processing result of the first column of the line overlapping area is cached; in the process of calculating the output data of the second data window area, the output data in the second data window area Column processing is performed on the row data except for the row overlap area to obtain a second column processing result, and row processing is performed on the second column processing result and the cached first column processing result to obtain the output data of the second data window.
  • the first area of interest is the area of interest shown in FIG. 6,
  • the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3.
  • the data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data ⁇ 2, 11 ⁇ , ⁇ 12, 13 ⁇ .
  • the process of obtaining the output data 29 falling in the data window area 1 by the first calculation unit 320 may include the following steps 2-1 to 2-3.
  • the process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include the following steps 3-1 and 3-2.
  • Step 2-1 Perform column processing on each row of data that falls into the data window area 1, and obtain the column processing result of each row of data.
  • the calculation method of pooling processing is to find the maximum value. Accordingly, the calculation methods of column processing and row processing are both to find the maximum value.
  • the maximum value of the first row of data ⁇ 12,3 ⁇ is obtained, and the column processing result 12 of the first row of data is obtained; the maximum value of the second row of data ⁇ 29,26 ⁇ is obtained to obtain the column of the second row of data Processing result 29; seeking the maximum value of the third row of data ⁇ 2,11 ⁇ to obtain the column processing result 11 of the third row of data; seeking the maximum value of the fourth row of data ⁇ 12,13 ⁇ to obtain the column of the fourth row of data Processing result 13.
  • step 2-2 row processing is performed on the column processing result obtained in step 2-1 to obtain the row processing result, that is, the output data of the data window area 1 is obtained.
  • the column processing result ⁇ 12, 29, 11, 13 ⁇ obtained in step 2-1 is maximized, and the row processing result 29 is obtained, that is, the output data 29 of the data window area 1 is obtained.
  • Step 2-3 Cache the column processing result ⁇ 11,13 ⁇ in the overlapping area of the line (that is, line 3 and line 4).
  • step 2-3 may be included in step 2-1 or step 2-2.
  • Step 3-1 Perform column processing on the row data falling in the data window area 3 except for the row overlap area.
  • Step 3-2 perform row processing on the column processing result ⁇ 14,22 ⁇ obtained in step 3-1 and the column processing result ⁇ 11,13 ⁇ of the row overlap region cached in step 2-3, to obtain the row processing result, namely Obtain the output data of the data window area 3.
  • the maximum value of the column processing result ⁇ 14, 22 ⁇ obtained in step 3-1 and the column processing result ⁇ 11, 13 ⁇ of the row overlap region cached in step 2-3 is obtained, and the output data 22 of the data window area 3 is obtained.
  • the first calculation unit 320 omitted the first row of data ⁇ 2,11 ⁇ and the second row in the obtaining data window area 3.
  • the read operation of the data ⁇ 12,13 ⁇ directly uses the column processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1. From the above step 2-1, it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data ⁇ 2,11 ⁇ and ⁇ 12,13 ⁇ in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.
  • this embodiment can avoid repeated reading of data, thereby saving bandwidth, and in addition, can also improve calculation efficiency.
  • step 2-1 to step 2-3, and step 3-1 to step 3-2 may be optional implementations of step S440.
  • step S440 performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows In the window area, in the process of obtaining the output data of the first data window area, the row processing result of the first column of the processing result of the row overlap area is cached; in the process of calculating the output data of the second data window area, the second data Perform column processing on the row data in the window area except the row overlap area to obtain the second column processing result, and perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, Obtain the output data of the second data window.
  • the first area of interest is the area of interest shown in FIG. 6, the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3.
  • the data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data ⁇ 2, 11 ⁇ , ⁇ 12, 13 ⁇ .
  • the process in which the first calculation unit 320 obtains the output data 29 falling in the data window area 1 may include steps 4-1 to 4-3.
  • the process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include steps 5-1 and 5-2.
  • step 4-1 is the same as step 2-1 described above, and step 4-2 is the same as step 2-2 described above.
  • Step 4-3 Cache the row processing result ⁇ 13 ⁇ of the column processing result ⁇ 11, 13 ⁇ in the overlapping area of the row (ie, the 3rd row and the 4th row).
  • step 4-3 may be included in step 4-2.
  • Step 5-1 is the same as step 3-1 described above.
  • Step 5-2 perform row processing on the column processing result ⁇ 14,22 ⁇ obtained in step 5-1 and the row processing result ⁇ 13 ⁇ of the row overlap area cached in step 4-3 to obtain the row processing result, that is, to obtain the data Output data of window area 3.
  • the maximum value of the column processing result ⁇ 14, 22 ⁇ obtained in step 5-1 and the row processing result ⁇ 13 ⁇ of the row overlapping area cached in step 4-3 is obtained, and the output data 22 of the data window area 3 is obtained.
  • the first calculation unit 320 eliminates the need for the first row of data ⁇ 2,11 ⁇ and the second row of data ⁇ 12, in the obtaining data window area 3. ,13 ⁇ , but directly use the row processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1.
  • step 4-1 it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data ⁇ 2,11 ⁇ and ⁇ 12,13 ⁇ in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.
  • step 4-3 of this embodiment only the row processing result ⁇ 13 ⁇ in the overlapping area (ie, the third row and the fourth row) needs to be cached. Relatively speaking, the cache requirement of the first computing unit 320 can be further reduced.
  • step 4-1 to step 4-3, and step 5-1 to step 5-2 may be optional implementations of step S440.
  • the left side is a column of data on the region of interest
  • the right side is a column of output data in the pooled output box (also referred to as an output feature map) corresponding to the region of interest.
  • the pixel point a is calculated from the input data ⁇ 1,2,3,4,5 ⁇
  • the pixel point b is calculated from the input data ⁇ 3,4,5,6,7 ⁇ .
  • the calculation result of ⁇ 3,4,5 ⁇ is both the intermediate result of pixel a and the intermediate result of pixel b.
  • the calculation result of ⁇ 3,4,5 ⁇ is cached, and when calculating pixel b, the calculation result of ⁇ 3,4,5 ⁇ can be directly read, based on The calculation result of ⁇ 3,4,5 ⁇ and ⁇ 6,7 ⁇ calculate the pixel point b.
  • this embodiment can save bandwidth and improve calculation efficiency by avoiding repeated reading of data, and in addition, can also reduce cache requirements.
  • step S440 the data falling in the data window area is processed by the method of processing first and then processing to obtain the output data of the data window area.
  • This application is not limited to this.
  • step S440 it is also possible to perform arithmetic processing on the data falling in the data window area by processing first and then column processing to obtain the output data of the data window area.
  • step S440 includes: obtaining a row processing result of each column of data falling in the data window area; performing column processing on the obtained row processing result to obtain output data in the data window area.
  • the first calculation unit 320 may include one or more arithmetic modules 324, and the arithmetic modules 324 are configured to perform arithmetic processing on the data falling in the data window area to obtain the data window area The output data.
  • step S440 in the above embodiment is executed by the arithmetic module 324 in the first calculation unit 320.
  • the first calculation unit 320 includes a plurality of arithmetic modules 324, wherein each arithmetic module 324 is configured to obtain one output data.
  • the number of arithmetic modules 324 included in the first calculation unit 320 may be related to the width of the pooled output box.
  • the resolution of the pooled output frame is E (height) ⁇ F (width), and the number of arithmetic modules 324 in the first calculation unit 320 may be F.
  • the number of computing modules in the first computing unit 320 may also be determined according to actual needs, which is not limited in this application. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing modules in the first computing unit 320.
  • the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.
  • the first calculation unit 320 has a function that the calculation module can be tailored. In other words, the number of operation modules of the first calculation unit 320 may be dynamically changed.
  • the first calculation unit 320 provided in the present application has the function of cutting calculation modules, it is possible to flexibly adjust the number of calculation modules in the first calculation unit 320 based on calculation requirements, thereby improving calculation efficiency and computing resources. Utilization rate.
  • the first calculation unit 320 may also include a calculation control module configured to perform step S420 and step S430 in the above embodiment.
  • the calculation control module and the arithmetic module 324 can be set separately or collectively.
  • the calculation module 324 includes the calculation control module.
  • the first calculation unit 320 may further include a sub-configuration interface 321, a sub-data interface 322, and a storage module 323.
  • the sub-configuration interface 321 is configured to receive configuration information from the configuration interface 310.
  • both the configuration interface 310 and the sub-configuration interface 321 may be an advanced peripheral bus (APB) interface.
  • API advanced peripheral bus
  • the sub-data interface 322 is configured to receive data of the input feature map from the data input interface 330.
  • the storage module 323 is configured to cache intermediate processing results.
  • the storage module 323 may be configured to buffer intermediate processing data that needs to be reused in the process of obtaining the output data of the first region of interest by the first calculation unit 320.
  • the storage module 323 may be configured to cache the data on the line overlapping areas of the data window area 1 and the data window area 3. Column processing result or row processing result.
  • the storage module 323 may be located in the calculation module 324.
  • the storage module 323 can also be configured to store the data of the input feature map received by the sub-data interface 322.
  • the data of the input feature map stored in the storage module 323 can support the first calculation unit 320 to perform pooling processing of multiple regions of interest.
  • the first calculation unit 320 can directly obtain the data of the input feature map from the storage module 323 without obtaining it from the outside.
  • FIG. 9 is another schematic block diagram of the first calculation unit 320.
  • the first calculation unit 320 includes a sub-configuration interface 321, a sub-data interface 322, a storage module 323 and an arithmetic module 324.
  • the arithmetic module 324 includes the control circuit part shown in the left half of the box marked 324 in FIG. 9 and the arithmetic circuit part shown in the right half of the box.
  • the sub-configuration interface 321 is configured to receive configuration information sent by the configuration interface 310, the configuration information indicating the location of the region of interest.
  • the sub-data interface 322 is configured to receive the data of the input characteristic map sent by the data input interface 330.
  • the storage module 323 is configured to buffer the data received by the sub-data interface 322, and can also be used to buffer intermediate processing results.
  • control circuit part in the arithmetic module 324 is configured to execute step S420 and step S430 described in the above embodiment.
  • control circuit part in the arithmetic module 324 is configured to calculate the start coordinates (w_start_floor) and the end coordinates (w_end_ceil ).
  • the arithmetic circuit part in the arithmetic module 324 is configured to execute step S440 described in the above embodiment.
  • the arithmetic circuit part includes a plurality of arithmetic circuits.
  • the calculation circuit is configured to have a comparison function.
  • the arithmetic circuit has one or more input terminals and an output terminal.
  • the arithmetic circuit can compare the data input at the input terminal to obtain the maximum value, and output the maximum value to the output terminal.
  • the arithmetic circuit can be realized by a circuit composed of a comparison circuit or a comparison operator.
  • the arithmetic circuit is configured to have the functions of calculating accumulation and averaging.
  • the arithmetic circuit has one or more input terminals and an output terminal.
  • the transport circuit can accumulate the data input at the input terminal to obtain the accumulated sum, and output the accumulated sum to the output terminal.
  • the arithmetic circuit can also perform an averaging operation on the accumulated sum to obtain the average value, and output the average value to the output terminal.
  • the arithmetic circuit can be realized by an adder and a multiplier.
  • the computing device 300 may include a data input interface 330.
  • the configuration interface 310 is also configured to transmit configuration information indicating the starting position of the input feature map in the external storage device and the resolution of the input feature map to the data input interface 330.
  • the data input interface 330 is configured to read the data of the input feature map from an external storage device according to the starting position and the resolution of the input feature map; and broadcast the read data of the input feature map to N computing units 320 in.
  • the computing device 300 may be a system on a chip. It should be understood that the storage resources of the system-on-chip are generally small, and the data to be processed generally needs to be obtained from an external storage device. In this application, the computing device 300 may obtain the input data of the ROI-Pooling layer, that is, the data of the input feature map from an external storage device.
  • the external storage device may be Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM).
  • DDR SDRAM can be referred to as DDR memory or DDR for short. It should be understood that this article does not limit the implementation of the external storage device.
  • the data input interface 330 may be configured to read data according to the data storage format of the external storage device.
  • the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8.
  • the input data interface 330 is configured to read input characteristic data from an external storage device in a burst of X bits ⁇ L, where L is a positive integer. For example, the value of X is 128.
  • each row of data is stored in a 128-bit (bit) aligned manner.
  • the quantization bit width of the data is 8bit
  • every 16 data is aligned and stored in an address of the external storage device, and the insufficient data at the end of the line is filled with invalid data to ensure that the starting address of the next line is 128bit aligned.
  • the data input interface 330 is configured to read the data in the external storage device in a burst of 128bit ⁇ L, that is, to access the external storage device in a burst mode, each time the external storage is accessed.
  • Devices are based on the granularity of 128bit.
  • L is a positive integer.
  • L is a positive integer less than or equal to 8, that is, the data input interface 330 accesses data of 8 addresses at most in each burst.
  • the manner in which the computing device provided in the present application reads data from the external storage device can be adapted to the data storage format of the external storage device, thereby improving the efficiency of data reading.
  • the input feature map is three-dimensional, for example, the input feature map is an input cube as shown in FIG. 1, and the data input interface 330 reads the input feature map in sequence, one by one.
  • the computing device 300 further includes a cache unit (not shown in FIG. 3).
  • the data input interface 330 is configured to: read the data of the input feature map in parallel from the external storage device in line-major order; cache the data of the input feature map read in parallel in the cache unit; The data of the graph undergoes parallel-serial conversion processing; the data of the input characteristic map obtained by the parallel-serial conversion processing is broadcast to N computing units 320.
  • the data of the input feature map is read in parallel from the external storage device in the main sequence of rows, which means that the data of the input feature map is read in parallel from the external storage device with the behavior granularity.
  • the data input interface 330 may be configured to read data in a zigzag scanning manner for an input feature map.
  • the data input interface 330 may broadcast the data of the input feature map obtained by the parallel-serial conversion process to the N computing units 320 in raster order.
  • the data input interface 330 adopts a buffer unit to buffer the data of the input feature map. While realizing data buffering, it can also solve the problem of internal and external data processing speed mismatch.
  • the buffer unit may be located in the data input interface 330.
  • the buffer unit may be a first input first output (first input first output, FIFO) module.
  • the FIFO module can be a FIFO memory or a FIFO queue.
  • the data bit width of the FIFO module can be designed according to the data storage format of the external storage device.
  • each row of data is stored in a 128-bit (bit) alignment manner, and the data bit width in the FIFO module is 128 bits.
  • the data input interface 330 can eliminate the filled invalid data while converting the data stored in the storage unit from parallel to serial processing.
  • the buffer unit can be provided separately from the data input interface 330, or can be integrated.
  • the buffer unit may be located in the data input interface 330.
  • the number of computing units 320 included in the computing device 300 may be related to the granularity of the data read by the data input interface 330 and the number of pixels processed by the computing unit 320 in each clock cycle.
  • the calculation device 300 includes The number of calculation units 320 can be set to 16, that is, the value of S can be 16.
  • calculation unit 320 processes 1 pixel in each clock cycle, and if the granularity of the data read by the data input interface 330 is 128 bits ⁇ 8 (assuming the quantization bit width of the data is 8 bits), then calculate The number of calculation units 320 included in the device 300 may be set to 128, that is, the value of S may be 128.
  • the number of computing units 320 included in the computing device 300 may be determined according to actual requirements. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing units 320 included in the computing device 300.
  • the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.
  • the computing device 300 has a function that the computing unit 320 can tailor. In other words, the number of computing units 320 in the computing device 300 can be dynamically changed.
  • the computing device 300 can be set to have a smaller number of computing units 320; when there are many regions of interest to be processed, the computing device 300 can be set to have more computing units.
  • the number of calculation units 320 when there are fewer regions of interest to be processed, the computing device 300 can be set to have a smaller number of computing units 320; when there are many regions of interest to be processed, the computing device 300 can be set to have more computing units. The number of calculation units 320.
  • the computing device 300 has a tailorable function of the computing unit 320, so that the computing device 300 can be flexibly adapted to ROI-pooling layers with different computing requirements.
  • the more computing units 320 included in the computing device 300 the larger the storage space required by the computing device 300, and the larger the overall volume of the computing device 300; the fewer computing units 320 included in the computing device 300, the greater the storage space required by the computing device 300.
  • the smaller the storage space required by the device 300 the smaller the overall volume of the computing device 300. Therefore, since the computing device 300 provided in the present application has the function of tailoring the computing unit 320, the number of computing units 320 can be flexibly adjusted based on computing requirements, which can effectively save resource occupation while ensuring computing performance.
  • the number S of computing units 320 included in the computing device 300 may also be one.
  • the computing device 300 further includes a data output interface 340 configured to output the output data calculated by the N computing units 320 to an external storage device.
  • the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8.
  • the data output interface 330 is configured to splice the output data of the S computing units into X bits for alignment buffering, and output the aligned buffered data to an external storage device.
  • N is an integer greater than 1.
  • the computing device 300 may further include an arbitration unit 350 configured to sequentially transmit the output data calculated by the N computing units 320 to the data output in a preset order. Interface 340.
  • the arbitration unit 350 may be configured to transmit the aligned data to the data output interface 340 using a fair polling algorithm.
  • the arbitration unit 350 is used to transmit it to the data output interface 340 in a preset order, which is beneficial to the management of the data in the subsequent process.
  • the configuration interface 310 is configured to, after the computing device 300 completes the pooling process of the N regions of interest, transmit data indicating the positions of the P regions of interest to the P computing units 320 of the S computing units 320
  • the P regions of interest correspond to the P calculation units 320 one-to-one
  • P is a positive integer less than or equal to S.
  • the P regions of interest are regions of interest on the current input feature map that have not been pooled.
  • the P regions of interest may be regions of interest on the next input feature map.
  • the P regions of interest can be unprocessed on the current input feature map
  • the region of interest for pooling processing can alternatively be the region of interest on the next input feature map.
  • the computing device provided by the present application can support parallel pooling processing of multiple regions of interest by including multiple computing units, and therefore, can improve the processing efficiency of the ROI-pooling layer.
  • the unit of the resolution or size of the image or region mentioned in this article is all pixels.
  • the resolution of the input feature map is 8 ⁇ 8 (unit: pixel)
  • the resolution of the region of interest is 7 ⁇ 5 (unit: pixel)
  • the size of the output pooling frame is 2 ⁇ 2 ( Unit: pixels).
  • the computing device 300 provided in this application may be an application specific integrated circuit (ASIC) or a field-programmable gate array (FPGA).
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the computing device 300 provided in this application can be applied to implement the hard acceleration function of the ROI-pooling layer in a convolution neural network (CNN).
  • CNN convolution neural network
  • the computing device 300 can be applied to an intellectual property (IP) core and a cooperative working circuit between the IP core.
  • IP intellectual property
  • the computing device 300 may also be applied to other types of neural network accelerators or processors that include the ROI-pooling layer.
  • FIG. 11 is a schematic flowchart of a method for calculating an ROI-pooling layer provided by an embodiment of the application.
  • the calculation method may be executed by the calculation device 300 of the above embodiment.
  • the calculation method includes the following steps S1110 and S1120.
  • S1120 Perform parallel pooling processing on the N regions of interest according to the configuration information to obtain output data of the corresponding regions of interest.
  • step S1120 the pooling process for N regions of interest is included, wherein the method for pooling the first region of interest includes steps S410 to S440 as described above. For the sake of brevity, I won't repeat them here. It should be understood that the first region of interest represents each of the N regions of interest.
  • step S1120 combined with step S440 described above may specifically include: obtaining the column processing result of each row of data falling in the data window area; performing row processing on the column processing result to obtain Output data in the data window area.
  • step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the processing result of the first column of the line overlap area; in the process of calculating the output data of the second data window area, remove the line overlap area in the second data window area Perform column processing on the other row data to obtain the second column processing result, perform row processing on the second column processing result and the cached first column processing result, to obtain the output data of the second data window.
  • step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the row processing result of the first column of the processing result of the line overlap area; in the process of calculating the output data of the second data window area, the data in the second data window area Perform column processing on the row data other than the row overlap area to obtain the second column processing result, perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, to obtain the second column processing result.
  • the output data of the data window In the process of obtaining the output data of the first data window area, cache the row processing result of the first column of the processing result of the line overlap area; in the process of calculating the output data of the second data window area, the data in the second data window area Perform column processing on the row data other than the row overlap area to obtain the second column processing result, perform row processing on the row processing result
  • step S1120 includes: performing parallel pooling processing on N regions of interest by N computing units in a computing device including S computing units, where N computing units There is a one-to-one correspondence with N regions of interest, S is an integer greater than 1, and N is an integer less than S.
  • N computing units to perform parallel pooling processing on N regions of interest at the hardware level, it can also be performed at the software level to improve the processing efficiency of the regions of interest, which is not specifically limited here.
  • the computing device is, for example, the computing device 300 in the above embodiment.
  • the N calculation units are, for example, the N calculation units 320 in the above embodiment.
  • the calculation unit processes one pixel in each clock cycle.
  • the calculation unit includes an arithmetic module for performing arithmetic processing on the data falling into the corresponding data window.
  • the number of arithmetic modules included in the calculation unit is related to the width of the pooled output box.
  • the operation module is, for example, the operation module 324 in the above embodiment.
  • the method further includes performing the following steps through the data input interface: obtaining the starting position of the input feature map in the external storage device, and the configuration indicating the resolution of the input feature map Information; according to the starting position and the resolution of the input feature map, read the data of the input feature map from an external storage device, and broadcast the read data of the input feature map to the N computing units.
  • the data input interface is, for example, the data input interface 330 in the above embodiment.
  • reading the data of the input feature map from the external storage device includes: in line main sequence, from the external storage device Read the data of the input feature map in parallel. Broadcasting the read data of the input feature map to the N computing units includes: buffering the data of the input feature map read in parallel in a cache unit; performing processing on the data of the input feature map in the cache unit Parallel-serial conversion processing; broadcasting the data of the input feature map obtained by the parallel-serial conversion processing to the N computing units.
  • the caching unit and the computing unit are separated, and at the same time, the caching unit can be provided separately from the data input interface, or can be integrated.
  • the buffer unit may be located in the data input interface.
  • the number S of computing units included in the computing device is related to the granularity of the data read by the data input interface and the number of pixels processed by the computing unit in each clock cycle.
  • the calculation method further includes: outputting the output data of the N regions of interest to the external storage device through the data output interface.
  • the calculation method further includes: sequentially transmitting the output data of the N regions of interest to the data output interface in a preset order through the arbitration unit.
  • the data output interface is, for example, the data output interface 340 in the above embodiment
  • the arbitration unit is, for example, the arbitration unit 350 in the above embodiment.
  • the calculation method further includes: acquiring configuration information indicating the positions of the P regions of interest, where P senses The region of interest is the region of interest that has not been pooled on the current input feature map, or P regions of interest are the regions of interest on the next input feature map, and P is a positive integer.
  • the P regions of interest are regions of interest on the current input feature map that have not been pooled.
  • the P regions of interest may be regions of interest on the next input feature map.
  • the P regions of interest can be unprocessed on the current input feature map
  • the region of interest for pooling processing can alternatively be the region of interest on the next input feature map.
  • step S1120 is implemented by N computing units.
  • this step S1120 can also be implemented by software.
  • FIG. 12 is a schematic block diagram of a neural network system 1200 provided by an embodiment of the application.
  • the neural network system 1200 includes a region-of-interest-pooling layer computing device 1210, and the computing device 1210 is the computing device 300 in the above embodiment.
  • neural network system 1200 may also include other neural network layer computing devices 1220.
  • the computing device 1220 includes any one or more of the following computing devices: a computing device in a convolutional layer, a computing device in an activation layer, a computing device in a pooling layer, and a computing device in a fully connected layer.
  • the computing devices mentioned herein can also be referred to as hardware accelerators.
  • calculation method of the ROI-pooling layer and the beneficial effects of the neural network system provided in this article can refer to the description of the calculation device of the region of interest-pooling layer in the above embodiment, and will not be repeated here.
  • the computer may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

A region of interest (ROI)-pooling layer computation method and device, and a neural network system. The computation device comprises a configuration interface and S computation units. The configuration interface is configured to transmit, to N computation units among the S computation units, configuration information indicating positions of N ROIs, wherein the N ROIs respectively correspond to the N computation units. Each of the N computation units is configured to perform pooling processing on the ROI corresponding thereto so as to acquire output data of the corresponding ROI. The computation device comprises multiple computation units, thereby allowing parallel pooling processing to be performed on multiple ROIs, and improving processing efficiency of a ROI-pooling layer without consuming too much power.

Description

感兴趣区域-池化层的计算方法与装置、以及神经网络系统Region of interest-pooling layer calculation method and device, and neural network system
版权申明Copyright statement
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。The content disclosed in this patent document contains copyrighted material. The copyright belongs to the copyright owner. The copyright owner does not object to anyone copying the patent document or the patent disclosure in the official records and archives of the Patent and Trademark Office.
技术领域Technical field
本申请涉及数据处理领域,并且更为具体地,涉及一种感兴趣区域-池化层的计算方法与装置、以及神经网络系统。This application relates to the field of data processing, and more specifically, to a calculation method and device for a region of interest-pooling layer, and a neural network system.
背景技术Background technique
目前,人工智能(artificial intelligence,AI)的研究获得突飞猛进,尤其卷积神经网络(convolution neural network,CNN)在图像分类和检测等领域的准确率远高于传统机器视觉算法。CNN由若干预先定义好的基本层组合而成,包括卷积层、激活层、池化(pooling)层、全连接层等,其中,池化层可以包括感兴趣区域(region of interest,ROI)-池化层(ROI-pooling层)。At present, the research of artificial intelligence (AI) has made rapid progress. In particular, the accuracy of convolution neural network (CNN) in the fields of image classification and detection is much higher than that of traditional machine vision algorithms. CNN is composed of several pre-defined basic layers, including convolutional layer, activation layer, pooling layer, fully connected layer, etc., where the pooling layer can include the region of interest (ROI) -Pooling layer (ROI-pooling layer).
当前技术中,感兴趣区域-池化层的数据处理通过中央处理器(central process unit,CPU)计算平台或图形处理器(graphics processing unit,GPU)计算平台实现。感兴趣区域-池化层的计算量很大。CPU计算平台的计算吞吐率不高,无法满足感兴趣区域-池化层的计算性能要求。GPU计算平台的功耗过高。可知,传统的CPU或GPU计算方案无法实现计算性能与功耗的平衡。In the current technology, the data processing of the region of interest-pooling layer is implemented by a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform. The region of interest-pooling layer is computationally intensive. The computing throughput rate of the CPU computing platform is not high, and it cannot meet the computing performance requirements of the region of interest-the pooling layer. The power consumption of the GPU computing platform is too high. It can be seen that traditional CPU or GPU computing solutions cannot achieve a balance between computing performance and power consumption.
因此,需要提出一种功耗较小的感兴趣区域-池化层的处理方案。Therefore, it is necessary to propose a processing solution for the region of interest-pooling layer with low power consumption.
发明内容Summary of the invention
本申请提供一种感兴趣区域-池化层的计算方法与装置、以及神经网络系统,可以有效提高感兴趣区域-池化层的计算效率,同时不会造成较大的功耗。The present application provides a calculation method and device for a region of interest-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the region of interest-pooling layer without causing large power consumption.
第一方面提供一种感兴趣区域-池化层的计算装置。该计算装置包括配 置接口与S个计算单元,S为大于1的整数。该配置接口被配置为,向S个计算单元中的N个计算单元传输指示N个感兴趣区域的位置的配置信息,其中,N个感兴趣区域与N个计算单元一一对应,N为小于或等于S的正整数。N个计算单元中每个计算单元被配置为,对与之对应的感兴趣区域进行池化处理,获得对应感兴趣区域的输出数据。The first aspect provides a computing device for a region of interest-pooling layer. The computing device includes a configuration interface and S computing units, where S is an integer greater than one. The configuration interface is configured to transmit configuration information indicating the positions of N regions of interest to N computing units of S computing units, where N regions of interest correspond to N computing units one-to-one, and N is less than Or a positive integer equal to S. Each of the N computing units is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
第二方面提供一种感兴趣区域-池化层的计算方法。该计算方法包括:获取指示N个感兴趣区域的位置的配置信息,N为正整数;根据配置信息,对N个感兴趣区域进行并行池化处理,获得相应感兴趣区域的输出数据。The second aspect provides a calculation method for the region of interest-pooling layer. The calculation method includes: obtaining configuration information indicating the positions of N regions of interest, where N is a positive integer; according to the configuration information, performing parallel pooling processing on the N regions of interest to obtain output data of the corresponding regions of interest.
第三方面提供一种神经网络系统,该神经网络系统包括第一方面的感兴趣区域-池化层的计算装置。A third aspect provides a neural network system, which includes the region of interest-pooling layer computing device of the first aspect.
本申请提供的计算装置包括多个计算单元,可以支持实现多个感兴趣区域的并行池化处理,因此,可以提高感兴趣区域-池化层的处理效率。The computing device provided by the present application includes multiple computing units, which can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the region of interest-pooling layer.
附图说明Description of the drawings
图1为感兴趣区域-池化层的功能示意图。Figure 1 is a schematic diagram of the function of the region of interest-the pooling layer.
图2为感兴趣区域-池化的示意图。Figure 2 is a schematic diagram of the region of interest-pooling.
图3为根据本申请实施例的计算装置的示意性框图。Fig. 3 is a schematic block diagram of a computing device according to an embodiment of the present application.
图4为本申请实施例中获取数据窗口区域的输出数据的示意性流程图。FIG. 4 is a schematic flowchart of obtaining output data of a data window area in an embodiment of the application.
图5为感兴趣区域-池化的另一示意图。Figure 5 is another schematic diagram of the region of interest-pooling.
图6为感兴趣区域-池化的又一示意图。Figure 6 is another schematic diagram of the region of interest-pooling.
图7为感兴趣区域-池化的再一示意图。Figure 7 is another schematic diagram of the region of interest-pooling.
图8为本申请实施例的计算单元的示意性框图。FIG. 8 is a schematic block diagram of a computing unit according to an embodiment of the application.
图9为本申请实施例的计算单元的另一示意性框图。FIG. 9 is another schematic block diagram of a computing unit according to an embodiment of the application.
图10为根据本申请实施例的计算装置的另一示意性框图。Fig. 10 is another schematic block diagram of a computing device according to an embodiment of the present application.
图11为根据本申请实施例的感兴趣区域-池化层的计算方法的示意性流程图。Fig. 11 is a schematic flowchart of a calculation method for a region of interest-pooling layer according to an embodiment of the present application.
图12为根据本申请实施例的神经网络系统的示意性框图。Fig. 12 is a schematic block diagram of a neural network system according to an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the accompanying drawings.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技 术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terminology used in the specification of the application herein is only for the purpose of describing specific embodiments, and is not intended to limit the application.
为了更好地理解本申请实施例,下文先介绍感兴趣区域-池化层(下文记为ROI-pooling层)的相关概念。In order to better understand the embodiments of the present application, the following first introduces the related concepts of the region of interest-pooling layer (hereinafter referred to as the ROI-pooling layer).
如图1所示,ROI-pooling层的功能是对特征图中的感兴趣区域(ROI)进行下采样。As shown in Figure 1, the function of the ROI-pooling layer is to down-sample the region of interest (ROI) in the feature map.
ROI-pooling层的输入数据(input feature map,IFM)是上一层的输出。ROI-pooling层的输入数据可以为一张特征图(feature map)组成的数组,也可以为多张特征图组成的3D的数组。如图1所示,ROI-pooling层的输入数据为L张特征图,每张特征图的分辨率为H(高)×W(宽)。The input data (input feature map, IFM) of the ROI-pooling layer is the output of the previous layer. The input data of the ROI-pooling layer can be an array composed of a feature map, or a 3D array composed of multiple feature maps. As shown in Figure 1, the input data of the ROI-pooling layer is L feature maps, and the resolution of each feature map is H (height)×W (width).
ROI-pooling层的输出数据(output feature map,OFM)由若干个立方体组成,如图1的右侧所示,共有M个立方体,ROI-pooling层输出的立方体的个数由输入特征图中的感兴趣区域(ROI)的数量决定。The output data (output feature map, OFM) of the ROI-pooling layer is composed of several cubes. As shown on the right side of Figure 1, there are a total of M cubes. The number of cubes output by the ROI-pooling layer is determined by the number of cubes in the input feature map. The number of regions of interest (ROI) is determined.
每个立方体的维度都是相同的,例如,在图1的示例中,每个立方体由L张输出特征图组成。其中,每张输出特征图的分辨率是相同的,例如,在图1的示例中,立方体中的每张输出特征图的分辨率均为E(高)×F(宽)。The dimensions of each cube are the same. For example, in the example in Figure 1, each cube is composed of L output feature maps. Among them, the resolution of each output feature map is the same. For example, in the example of FIG. 1, the resolution of each output feature map in the cube is E (height)×F (width).
ROI-pooling层的功能是输入的特征图中的感兴趣区域进行下采样。例如,在图1的示例中,以L张特征图中的一张特征图为例,ROI-pooling层将分辨率为H×W的输入特征图下采样处理为分辨率为E×F的输出特征图。The function of the ROI-pooling layer is to downsample the region of interest in the input feature map. For example, in the example in Figure 1, taking a feature map in L feature maps as an example, the ROI-pooling layer downsamples the input feature map with a resolution of H×W into an output with a resolution of E×F Feature map.
ROI-pooling层输出的特征图的分辨率可以预先定义,例如,在图1的示例中,输出的立方体的分辨率E×F可以是预定义的。The resolution of the feature map output by the ROI-pooling layer may be predefined. For example, in the example of FIG. 1, the resolution E×F of the output cube may be predefined.
ROI-pooling层池化处理的映射方式也可以预先定义,一般为求最大(max)或者求平均(avg)两种。The mapping method of the ROI-pooling layer pooling processing can also be defined in advance, and generally there are two types: maximum (max) or average (avg).
可以理解到,ROI-pooling层的特点是,待进行池化处理的感兴趣区域的尺寸可以不固定,每个感兴趣区域对应的输出特征图的尺寸是固定的。It can be understood that the feature of the ROI-pooling layer is that the size of the region of interest to be pooled may not be fixed, and the size of the output feature map corresponding to each region of interest is fixed.
作为示例而非限定,在图1中,ROI-pooling层的计算过程为:根据输出的立方体的分辨率E×F、输入特征图中的感兴趣区域的位置,逐点反推出输出数据在输入特征图上对应的数据窗口区域;对该数据窗口区域内的数据进行运算处理,获得对该数据窗口区域对应的输出数据。这里的运算处理可以是求最大值,或者是求平均值。As an example and not a limitation, in Figure 1, the calculation process of the ROI-pooling layer is: according to the resolution of the output cube E×F, the position of the region of interest in the input feature map, point by point inference of the output data in the input The corresponding data window area on the feature map; performing arithmetic processing on the data in the data window area to obtain output data corresponding to the data window area. The arithmetic processing here can be the maximum value or the average value.
为了便于理解与描述,而非限定,下文先对本申请涉及的概念与术语进 行说明。For ease of understanding and description, rather than limitation, the concepts and terminology involved in this application are described below.
1、感兴趣区域1. Region of interest
感兴趣区域表示,输入特征图上待进行池化处理(即下采样处理)的区域。The region of interest represents the region to be pooled (ie, down-sampling) on the input feature map.
2、池化输出框的分辨率2. The resolution of the pooled output frame
池化输出框的分辨率表示,感兴趣区域经过池化处理之后获得的特征图的分辨率。例如,池化输出框的分辨率为图1中所示的输出的立方体的分辨率E×F。The resolution of the pooled output frame indicates the resolution of the feature map obtained after the region of interest is pooled. For example, the resolution of the pooled output frame is the resolution E×F of the output cube shown in FIG. 1.
池化输出框也可称为感兴趣区域经过池化处理之后所得的特征图。The pooled output box can also be referred to as a feature map obtained after the region of interest is pooled.
本文中,感兴趣区域经过池化处理之后所得的特征图中的像素点称为输出数据。应理解,假设池化输出框的分辨率为E×F,则一个感兴趣区域对应(E*F)个输出数据。In this article, the pixel points in the feature map obtained after the region of interest undergoes pooling processing are called the output datat should be understood that, assuming that the resolution of the pooled output frame is E×F, then one region of interest corresponds to (E*F) pieces of output data.
3、数据窗口区域3. Data window area
数据窗口区域表示,某个感兴趣区域对应的输出数据在该感兴趣区域上对应的区域。The data window area represents the area corresponding to the output data corresponding to a certain area of interest on the area of interest.
以某个感兴趣区域对应的一个输出数据为例,该输出数据是由该感兴趣区域中的某个子区域内的数据经过池化处理得到的。这里的某个子区域可以记为数据窗口区域。Taking an output data corresponding to a certain region of interest as an example, the output data is obtained by pooling data in a certain sub-region in the region of interest. A certain sub-area here can be recorded as the data window area.
下面结合图2的示例,介绍上述概念。The above concepts are introduced below in conjunction with the example in Figure 2.
在图2中,池化输出框的分辨率为2×2,表示输入特征图中的任一个感兴趣区域经过池化处理之后得到的特征图的分辨率为2×2,或者,输入特征图中的任一个感兴趣区域对应4个输出数据。In Figure 2, the resolution of the pooled output frame is 2×2, which means that the resolution of the feature map obtained after any region of interest in the input feature map is pooled is 2×2, or the input feature map Any one of the regions of interest corresponds to 4 output data.
在图2中,示出一个由输入特征图中的第4行至第8行、第1列至第7列围成的感兴趣区域。假设该感兴趣区域对应的4个输出数据为C1、C2、C3与C4,其中,C1在该感兴趣区域上对应的数据窗口区域为A1,C2在该感兴趣区域上对应的数据窗口区域为A2,C3在该感兴趣区域上对应的数据窗口区域为A3,C4在该感兴趣区域上对应的数据窗口区域为A4。也就是说,C1是通过对数据窗口区域A1内的数据进行运算处理(求最大值或求平均值)得到的,C2、C3与C4的获得方式以此类推。In Fig. 2, a region of interest surrounded by rows 4 to 8, and columns 1 to 7 in the input feature map is shown. Assume that the four output data corresponding to the region of interest are C1, C2, C3, and C4, where the data window area corresponding to C1 on the region of interest is A1, and the data window area corresponding to C2 on the region of interest is The data window area corresponding to A2 and C3 on the area of interest is A3, and the data window area corresponding to C4 on the area of interest is A4. In other words, C1 is obtained by performing arithmetic processing on the data in the data window area A1 (maximum value or average value), and C2, C3, and C4 are obtained by analogy.
应理解,图2仅为示例而非限定,实际应用中,一个输入特征图上可以包括多个感兴趣区域。It should be understood that FIG. 2 is only an example and not a limitation. In practical applications, one input feature map may include multiple regions of interest.
传统技术中,采用中央处理器(central process unit,CPU)计算平台或图形处理器(graphics processing unit,GPU)计算平台处理ROI-pooling层的计算。CPU计算平台的计算效率不能满足ROI-pooling层的计算性能要求,GPU计算平台的功耗较大。In traditional technology, a central processing unit (CPU) computing platform or a graphics processing unit (GPU) computing platform is used to process the calculation of the ROI-pooling layer. The computing efficiency of the CPU computing platform cannot meet the computing performance requirements of the ROI-pooling layer, and the GPU computing platform consumes a lot of power.
本申请提出一种ROI-pooling层的计算方法与装置、以及神经网络系统,可以有效提高ROI-pooling层的计算效率,同时不会造成较大的功耗。This application proposes a calculation method and device for the ROI-pooling layer, and a neural network system, which can effectively improve the calculation efficiency of the ROI-pooling layer without causing large power consumption.
图3为本申请提供的ROI-pooling层的计算装置300的示意性框图。FIG. 3 is a schematic block diagram of a computing device 300 of the ROI-pooling layer provided by this application.
如图3所示,计算装置300包括多个计算单元320,每个计算单元320均具有对感兴趣区域进行池化处理的功能。应理解,通过该多个计算单元320,可以实现多个感兴趣区域的并行池化处理。也就是说,本申请提供的计算装置可以实现ROI-pooling层的并行数据处理,从而可以提高ROI-pooling层的处理效率。As shown in FIG. 3, the computing device 300 includes a plurality of computing units 320, and each computing unit 320 has a function of performing pooling processing on a region of interest. It should be understood that through the multiple computing units 320, parallel pooling processing of multiple regions of interest can be implemented. In other words, the computing device provided by the present application can implement parallel data processing of the ROI-pooling layer, thereby improving the processing efficiency of the ROI-pooling layer.
如图3所示,计算装置300还包括配置接口310,被配置为向N个计算单元320传输指示N个感兴趣区域的位置的配置信息,其中,N个感兴趣区域与N个计算单元320一一对应。As shown in FIG. 3, the computing device 300 further includes a configuration interface 310 configured to transmit configuration information indicating the positions of the N regions of interest to the N computing units 320, where the N regions of interest and the N computing units 320 One-to-one correspondence.
该配置信息可以指示N个感兴趣区域在输入特征图上的位置。The configuration information may indicate the positions of the N regions of interest on the input feature map.
例如,该配置信息可以指示N个感兴趣区域中像素点的坐标。For example, the configuration information may indicate the coordinates of the pixel points in the N regions of interest.
可选地,感兴趣区域的位置包括该感兴趣区域左上角第一个像素的坐标,与该感兴趣区域的大小。Optionally, the position of the region of interest includes the coordinates of the first pixel at the upper left corner of the region of interest and the size of the region of interest.
可选地,感兴趣区域的位置包括该感兴趣区域中所有像素点的坐标。Optionally, the position of the region of interest includes the coordinates of all pixels in the region of interest.
配置接口310可以为外围总线(advanced peripheral bus,APB)接口。The configuration interface 310 may be an advanced peripheral bus (APB) interface.
N个计算单元320中每个计算单元320被配置为,对与之对应的感兴趣区域进行池化处理,获得对应感兴趣区域的输出数据。Each of the N computing units 320 is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
例如,N个计算单元320中每个计算单元320被配置为,根据所接收的配置信息确定与之对应的感兴趣区域;对与之对应的感兴趣区域内的数据进行池化处理,获得该感兴趣区域的输出数据。For example, each of the N computing units 320 is configured to determine the corresponding region of interest according to the received configuration information; perform pooling processing on the data in the corresponding region of interest to obtain the Output data of the region of interest.
可选地,N个计算单元320表示计算装置300的全部计算单元,即计算装置300的全部计算单元320参与池化处理。Optionally, N computing units 320 represent all computing units of the computing device 300, that is, all computing units 320 of the computing device 300 participate in the pooling process.
可选地,N个计算单元320表示计算装置300的部分计算单元,即计算装置300的部分计算单元320参与池化处理。Optionally, the N computing units 320 represent part of the computing units of the computing device 300, that is, part of the computing units 320 of the computing device 300 participate in the pooling process.
例如,记计算装置300包括计算单元的总数量为S,则N为小于或等于 S的正整数。例如,S为大于1的整数。For example, if the total number of computing units included in the computing device 300 is S, then N is a positive integer less than or equal to S. For example, S is an integer greater than 1.
应理解,在实际应用中,可以根据应用需求确定计算装置300中的部分还是全部计算单元参与运算。It should be understood that, in actual applications, it may be determined whether part or all of the computing units in the computing device 300 participate in operations according to application requirements.
可选地,N为大于1的整数。Optionally, N is an integer greater than 1.
应理解,在本实施例中,通过大于1个的计算单元320可以实现大于1的感兴趣区域的池化处理,即可以实现ROI-pooling层的并行数据处理,从而可以提高ROI-pooling层的处理效率。It should be understood that in this embodiment, the pooling processing of the region of interest greater than 1 can be realized by more than one calculation unit 320, that is, the parallel data processing of the ROI-pooling layer can be realized, so that the performance of the ROI-pooling layer can be improved. Processing efficiency.
应理解,N也可以等于1。例如,若当前计算任务中只需对1个感兴趣区域进行池化处理,则将N的取值设置为1。It should be understood that N can also be equal to 1. For example, if only one region of interest needs to be pooled in the current computing task, the value of N is set to 1.
还应理解,无论N取值如何,本申请提供的计算装置可以支持实现多个感兴趣区域的并行池化处理,因此,可以提高ROI-pooling层的处理效率。It should also be understood that regardless of the value of N, the computing device provided in the present application can support parallel pooling processing of multiple regions of interest, and therefore, can improve the processing efficiency of the ROI-pooling layer.
应理解,N个计算单元320中每个计算单元320的功能是相同的,即每个计算单元320对感兴趣区域进行池化处理的方法是类似的。为了便于理解与描述,本文中以N个计算单元320中的第一计算单元320为例进行描述计算单元320的功能与操作。应理解,本文中对第一计算单元320的描述可以适应性地适用于N个计算单元320中的每个计算单元320。It should be understood that the function of each calculation unit 320 in the N calculation units 320 is the same, that is, the method for each calculation unit 320 to pool the region of interest is similar. In order to facilitate understanding and description, the function and operation of the calculation unit 320 are described by taking the first calculation unit 320 of the N calculation units 320 as an example. It should be understood that the description of the first calculation unit 320 herein may be adaptively applied to each calculation unit 320 of the N calculation units 320.
第一计算单元320被配置为,对第一感兴趣区域进行池化处理,获得第一感兴趣区域的输出数据。该第一感兴趣区域表示N个感兴趣区域中与该第一计算单元320对应的感兴趣区域。第一计算单元320对第一感兴趣区域进行池化处理,获得第一感兴趣区域的输出数据的方法包括如下步骤S410至步骤S440,如图4所示。The first calculation unit 320 is configured to perform pooling processing on the first region of interest to obtain output data of the first region of interest. The first region of interest represents a region of interest corresponding to the first calculation unit 320 among the N regions of interest. The first calculation unit 320 performs pooling processing on the first region of interest, and the method for obtaining the output data of the first region of interest includes the following steps S410 to S440, as shown in FIG. 4.
S410,获取输入特征图的数据,输入特征图包括K个感兴趣区域,K为不小于N的正整数。S410: Obtain data of an input feature map, where the input feature map includes K regions of interest, and K is a positive integer not less than N.
继续参见图3,计算装置300还包括数据输入接口330,被配置为从外部存储设备中读取输入特征图的数据。第一计算单元320可以从数据输入接口320获取输入特征图的数据。例如,数据输入接口330被配置为,向第一计算单元320发送输入特征图的数据。Continuing to refer to FIG. 3, the computing device 300 further includes a data input interface 330 configured to read data input to the characteristic map from an external storage device. The first calculation unit 320 may obtain data of the input feature map from the data input interface 320. For example, the data input interface 330 is configured to send data of the input feature map to the first calculation unit 320.
S420,根据第一感兴趣区域的位置,以及池化输出框的分辨率,获得第一感兴趣区域的待输出数据在第一感兴趣区域上对应的数据窗口区域。S420: According to the position of the first region of interest and the resolution of the pooled output frame, obtain a data window area corresponding to the data to be output of the first region of interest on the first region of interest.
S430,从所获取的输入特征图的数据中选择落入数据窗口区域的数据。S430: Select data that falls into the data window area from the acquired data of the input feature map.
S440,对落入数据窗口区域的数据进行运算处理,获得数据窗口区域的 输出数据。其中,运算处理可以为求最大值,或者求平均值,具体处理方式可以预定义。S440: Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area. Among them, the arithmetic processing can be the maximum value or the average value, and the specific processing method can be predefined.
应理解,当第一计算单元320获得第一感兴趣区域的所有数据窗口区域的输出数据,也就获得了第一感兴趣区域的输出数据。假设池化输出框的分辨率为E×F,则第一感兴趣区域对应(E*F)个输出数据。It should be understood that when the first calculation unit 320 obtains the output data of all the data window regions of the first region of interest, it also obtains the output data of the first region of interest. Assuming that the resolution of the pooled output frame is E×F, the first region of interest corresponds to (E*F) pieces of output data.
本申请提供的计算装置可以实现ROI-pooling层的并行数据处理,相对于现有技术,可以在较低功耗的前提下,有效提高ROI-pooling层的处理效率。The computing device provided in the present application can implement parallel data processing of the ROI-pooling layer, and compared with the prior art, can effectively improve the processing efficiency of the ROI-pooling layer under the premise of lower power consumption.
应理解,基于池化输出框的分辨率,将一个感兴趣区域划分为多个数据窗口区域的方法为现有技术,本文对此不作详述。It should be understood that the method of dividing a region of interest into multiple data window regions based on the resolution of the pooled output frame is a prior art, which will not be described in detail herein.
步骤S440可以通过多种实现方式实现。Step S440 can be implemented in multiple implementation manners.
可选地,步骤S440,对落入数据窗口区域的数据进行运算处理,获得数据窗口区域的输出数据,包括:获取落入数据窗口区域的每一行数据的列处理结果;对列处理结果进行行处理,获得数据窗口区域的输出数据。Optionally, step S440, performing arithmetic processing on the data falling in the data window area to obtain output data in the data window area includes: obtaining a column processing result of each row of data falling in the data window area; performing row processing on the column processing result Process to obtain the output data in the data window area.
应理解,若池化处理的运算方式(也可称为映射关系)为求最大值,则列处理对应的运算方式为求最大值,行处理对应的运算方式也为求最大值。若池化处理的运算方式(也可称为映射关系)为求平均值,则列处理对应的运算方式为求累加和,行处理对应的运算方式为先求累加和,再求平均值。It should be understood that if the operation mode (also referred to as the mapping relationship) of the pooling process is to find the maximum value, the operation mode corresponding to the column processing is the maximum value, and the operation mode corresponding to the row processing is also the maximum value. If the operation mode (also called the mapping relationship) of the pooling process is averaging, the operation mode corresponding to the column processing is the cumulative sum, and the operation mode corresponding to the row processing is the cumulative sum first, and then the average value.
下文结合图5,描述第一计算单元320对感兴趣区域进行池化处理的示例。在图5中,池化处理的运算方式(也可称为映射关系)为求最大值,池化输出框的分辨率为2×2,感兴趣区域为分辨率为8×4。Hereinafter, in conjunction with FIG. 5, an example in which the first calculation unit 320 performs pooling processing on the region of interest is described. In FIG. 5, the calculation method of the pooling process (also called the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2×2, and the area of interest is 8×4.
在图5的示例中,第一计算单元320对该感兴趣区域的池化处理包括如下步骤1-1与步骤1-2。In the example of FIG. 5, the pooling processing of the region of interest by the first calculation unit 320 includes the following steps 1-1 and 1-2.
步骤1-1,基于池化输出框的分辨率2×2,确定该感兴趣区域中的4个数据窗口区域,如图5中所示的数据窗口区域1、数据窗口区域2、数据窗口区域3、数据窗口区域4。Step 1-1, based on the 2×2 resolution of the pooled output frame, determine 4 data window areas in the region of interest, as shown in Figure 5, data window area 1, data window area 2, and data window area 3. Data window area 4.
应理解,该步骤1-1对应图4中的步骤S420与步骤S430。It should be understood that this step 1-1 corresponds to step S420 and step S430 in FIG. 4.
步骤1-2,分别对这4数据窗口区域内的数据进行求最大值处理,获得对应的输出数据。Step 1-2, perform maximum value processing on the data in the 4 data window areas respectively to obtain corresponding output data.
如图5所示,对数据窗口区域1内的数据进行求最大值处理,获得输出数据29;对数据窗口区域2内的数据进行求最大值处理,获得输出数据31; 对数据窗口区域3内的数据进行求最大值处理,获得输出数据30;对数据窗口区域4内的数据进行求最大值处理,获得输出数据28。As shown in Figure 5, the data in data window area 1 is subjected to maximum value processing to obtain output data 29; the data in data window area 2 is subjected to maximum value processing to obtain output data 31; The data in the data window is subjected to maximum value processing to obtain the output data 30; the data in the data window area 4 is subjected to maximum value processing to obtain the output data 28.
以数据窗口区域1为例,步骤1-2中,对数据窗口区域1内的数据进行求最大值处理,获得数据窗口区域1的输出数据29可以包括如下子步骤。Taking the data window area 1 as an example, in step 1-2, the maximum value processing is performed on the data in the data window area 1 to obtain the output data 29 of the data window area 1 may include the following sub-steps.
子步骤1-2-1,获取落入数据窗口区域1的每一行数据的列处理结果。Sub-step 1-2-1, obtain the column processing result of each row of data that falls into the data window area 1.
例如,对第一行数据{12,3}求最大值,获得第一行数据的列处理结果12;对第二行数据{29,26}求最大值,获得第二行数据的列处理结果29;对第三行数据{2,11}求最大值,获得第三行数据的列处理结果11;对第四行数据{12,13}求最大值,获得第四行数据的列处理结果13。For example, seek the maximum value of the first row of data {12,3} to obtain the column processing result of the first row of data 12; seek the maximum value of the second row of data {29,26} to obtain the column processing result of the second row of data 29; Obtain the maximum value of the third row of data {2,11} to obtain the column processing result of the third row of data 11; Obtain the maximum value of the fourth row of data {12,13} to obtain the column processing result of the fourth row of data 13.
子步骤1-2-2,对数据窗口区域1中的每一行数据的列处理结果进行行处理,获得数据窗口区域1的输出数据29。In sub-step 1-2-2, row processing is performed on the column processing result of each row of data in the data window area 1, and the output data 29 of the data window area 1 is obtained.
例如,对数据窗口区域1中的4行数据的列处理结果{12,29,11,13}进行求最大值,获得行处理结果为29,即获得数据窗口区域1的输出数据29。For example, the maximum value of the column processing results {12, 29, 11, 13} of the 4 rows of data in the data window area 1 is obtained, and the row processing result is 29, that is, the output data 29 of the data window area 1 is obtained.
应理解,步骤1-2对应图4中的步骤S440。It should be understood that step 1-2 corresponds to step S440 in FIG. 4.
还应理解,在图5所示的示例中,输入分辨率为8×4的特征图,输出分辨率为2×2的特征图。It should also be understood that in the example shown in FIG. 5, a feature map with a resolution of 8×4 is input, and a feature map with a resolution of 2×2 is output.
在图5的示例中,不同数据窗口区域之间不具有重叠区域。有些情况下,不同数据窗口区域之间可能具有重叠区域,如图6所示。In the example of FIG. 5, there is no overlapping area between different data window areas. In some cases, there may be overlapping areas between different data window areas, as shown in Figure 6.
在图6中,池化处理的运算方式(也可称为映射关系)为求最大值,池化输出框的分辨率为2×2,感兴趣区域为分辨率为6×4。从图6可以看出,输出数据29与22分别对应的数据窗口区域1与数据窗口区域3之间具有重叠区域,输出数据31与23分别对应的数据窗口区域2与数据窗口区域4之间具有重叠区域。In FIG. 6, the calculation method of the pooling process (also referred to as the mapping relationship) is to find the maximum value, the resolution of the pooling output frame is 2×2, and the area of interest is the resolution of 6×4. It can be seen from Figure 6 that there is an overlap area between the data window area 1 and the data window area 3 corresponding to the output data 29 and 22, and there is an overlap area between the data window area 2 and the data window area 4 corresponding to the output data 31 and 23, respectively. Overlapping area.
应理解,图6仅为示例而非限定,实际应用中,两个数据窗口区域之间的重叠区域可以包括一行或多行数据,或者,可以包括一列或多列数据。It should be understood that FIG. 6 is only an example and not a limitation. In practical applications, the overlapping area between the two data window areas may include one or more rows of data, or may include one or more columns of data.
若两个数据窗口区域之间的重叠区域包括一行或多行数据,该重叠区域也可以称为行重叠区域。若两个数据窗口区域之间的重叠区域包括一列或多列数据,该重叠区域也可以称为列重叠区域。If the overlap area between the two data window areas includes one or more rows of data, the overlap area may also be referred to as a row overlap area. If the overlapping area between two data window areas includes one or more columns of data, the overlapping area may also be referred to as a column overlapping area.
可选地,步骤S440,对落入数据窗口区域的数据进行运算处理,获得数据窗口区域的输出数据,包括:若第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,则在获取第一数据窗口区域的输出 数据的过程中,缓存行重叠区域的第一列处理结果;在计算第二数据窗口区域的输出数据的过程中,对第二数据窗口区域中除行重叠区域之外的行数据进行列处理,获得第二列处理结果,对第二列处理结果与所缓存的第一列处理结果进行行处理,获得第二数据窗口的输出数据。Optionally, step S440: performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows Window area, in the process of obtaining the output data of the first data window area, the processing result of the first column of the line overlapping area is cached; in the process of calculating the output data of the second data window area, the output data in the second data window area Column processing is performed on the row data except for the row overlap area to obtain a second column processing result, and row processing is performed on the second column processing result and the cached first column processing result to obtain the output data of the second data window.
例如,第一感兴趣区域如图6中所示的感兴趣区域,第一数据窗口区域如图6中所示的数据窗口区域1,第二数据窗口区域如图6中所示的数据窗口区域3。数据窗口区域1与数据窗口区域3具有行重叠区域,该行重叠区域包括两行数据{{2,11},{12,13}}。For example, the first area of interest is the area of interest shown in FIG. 6, the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3. The data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data {{2, 11}, {12, 13}}.
在图6的示例中,第一计算单元320获取落入数据窗口区域1的输出数据29的过程可以包括如下步骤2-1至步骤2-3。第一计算单元320获得落入数据窗口区域3的输出数据22的过程可以包括如下步骤3-1与步骤3-2。In the example of FIG. 6, the process of obtaining the output data 29 falling in the data window area 1 by the first calculation unit 320 may include the following steps 2-1 to 2-3. The process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include the following steps 3-1 and 3-2.
步骤2-1,对落入数据窗口区域1的每一行数据进行列处理,获得每一行数据的列处理结果。本例中,池化处理的运算方式为求最大值,相应地,列处理与行处理的运算方式均是求最大值。Step 2-1: Perform column processing on each row of data that falls into the data window area 1, and obtain the column processing result of each row of data. In this example, the calculation method of pooling processing is to find the maximum value. Accordingly, the calculation methods of column processing and row processing are both to find the maximum value.
参见图6,对第一行数据{12,3}求最大值,获得第一行数据的列处理结果12;对第二行数据{29,26}求最大值,获得第二行数据的列处理结果29;对第三行数据{2,11}求最大值,获得第三行数据的列处理结果11;对第四行数据{12,13}求最大值,获得第四行数据的列处理结果13。Referring to Figure 6, the maximum value of the first row of data {12,3} is obtained, and the column processing result 12 of the first row of data is obtained; the maximum value of the second row of data {29,26} is obtained to obtain the column of the second row of data Processing result 29; seeking the maximum value of the third row of data {2,11} to obtain the column processing result 11 of the third row of data; seeking the maximum value of the fourth row of data {12,13} to obtain the column of the fourth row of data Processing result 13.
步骤2-2,对步骤2-1获得的列处理结果进行行处理,获得行处理结果,即获得数据窗口区域1的输出数据。In step 2-2, row processing is performed on the column processing result obtained in step 2-1 to obtain the row processing result, that is, the output data of the data window area 1 is obtained.
参见图6,对步骤2-1获得的列处理结果{12,29,11,13}求最大值,获得行处理结果29,即获得数据窗口区域1的输出数据29。Referring to FIG. 6, the column processing result {12, 29, 11, 13} obtained in step 2-1 is maximized, and the row processing result 29 is obtained, that is, the output data 29 of the data window area 1 is obtained.
步骤2-3,缓存行重叠区域(即第3行与第4行)的列处理结果{11,13}。Step 2-3: Cache the column processing result {11,13} in the overlapping area of the line (that is, line 3 and line 4).
可选地,步骤2-3可以包含在步骤2-1或步骤2-2中。Optionally, step 2-3 may be included in step 2-1 or step 2-2.
步骤3-1,对落入数据窗口区域3中除行重叠区域之外的行数据进行列处理。Step 3-1: Perform column processing on the row data falling in the data window area 3 except for the row overlap area.
参见图6,对第三行数据{7,14}求最大值,获得第三行数据的列处理结果14;对第四行数据{22,4}求最大值,获得第四行数据的列处理结果22。Refer to Figure 6, the maximum value of the third row of data {7,14} is obtained, and the column processing result 14 of the third row of data is obtained; the maximum value of the fourth row of data {22,4} is obtained, and the column of the fourth row of data is obtained Processing result 22.
步骤3-2,对步骤3-1获得的列处理结果{14,22}与步骤2-3所缓存的行重叠区域的列处理结果{11,13}进行行处理,获得行处理结果,即获得数据窗口区域3的输出数据。Step 3-2, perform row processing on the column processing result {14,22} obtained in step 3-1 and the column processing result {11,13} of the row overlap region cached in step 2-3, to obtain the row processing result, namely Obtain the output data of the data window area 3.
即对步骤3-1获得的列处理结果{14,22}与步骤2-3所缓存的行重叠区域的列处理结果{11,13}求最大值,获得数据窗口区域3的输出数据22。That is, the maximum value of the column processing result {14, 22} obtained in step 3-1 and the column processing result {11, 13} of the row overlap region cached in step 2-3 is obtained, and the output data 22 of the data window area 3 is obtained.
在上面结合图6的示例中,第一计算单元320在获取数据窗口区域3的输出数据的过程中,省去了对获取数据窗口区域3中第一行数据{2,11}与第二行数据{12,13}的读操作,而是直接利用在获取数据窗口区域1的输出数据的过程中所缓存的这两行数据的列处理结果。从上述步骤2-1可知,在获取数据窗口区域1的输出数据的过程中已经执行了对行重复区域内的两行数据{2,11}与{12,13}的读操作了。因此,本实施例在不同数据窗口区域之间具有重叠区域的情形下,可以避免重复读取数据。In the above example in conjunction with FIG. 6, in the process of obtaining the output data of the data window area 3, the first calculation unit 320 omitted the first row of data {2,11} and the second row in the obtaining data window area 3. The read operation of the data {12,13} directly uses the column processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1. From the above step 2-1, it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data {2,11} and {12,13} in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.
应理解,本实施例可以避免重复读取数据,从而可以节省带宽,此外,也可以提高计算效率。It should be understood that this embodiment can avoid repeated reading of data, thereby saving bandwidth, and in addition, can also improve calculation efficiency.
应理解,上述步骤2-1至步骤2-3,以及步骤3-1至步骤3-2可以为步骤S440的可选实施方式。It should be understood that the above step 2-1 to step 2-3, and step 3-1 to step 3-2 may be optional implementations of step S440.
可选地,步骤S440,对落入数据窗口区域的数据进行运算处理,获得数据窗口区域的输出数据,包括:若第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取第一数据窗口区域的输出数据的过程中,缓存行重叠区域的第一列处理结果的行处理结果;在计算第二数据窗口区域的输出数据的过程中,对第二数据窗口区域中除行重叠区域之外的行数据进行列处理,获得第二列处理结果,对第二列处理结果与所缓存的行重叠区域的第一列处理结果的行处理结果进行行处理,获得第二数据窗口的输出数据。Optionally, step S440: performing arithmetic processing on the data falling in the data window area to obtain output data of the data window area, including: if the first area of interest includes the first data window area and the second data area with overlapping rows In the window area, in the process of obtaining the output data of the first data window area, the row processing result of the first column of the processing result of the row overlap area is cached; in the process of calculating the output data of the second data window area, the second data Perform column processing on the row data in the window area except the row overlap area to obtain the second column processing result, and perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, Obtain the output data of the second data window.
作为一个示例。参见图6。例如,第一感兴趣区域如图6中所示的感兴趣区域,第一数据窗口区域如图6中所示的数据窗口区域1,第二数据窗口区域如图6中所示的数据窗口区域3。数据窗口区域1与数据窗口区域3具有行重叠区域,该行重叠区域包括两行数据{{2,11},{12,13}}。As an example. See Figure 6. For example, the first area of interest is the area of interest shown in FIG. 6, the first data window area is the data window area 1 shown in FIG. 6, and the second data window area is the data window area shown in FIG. 6. 3. The data window area 1 and the data window area 3 have a row overlap area, and the row overlap area includes two rows of data {{2, 11}, {12, 13}}.
第一计算单元320获取落入数据窗口区域1的输出数据29的过程可以包括步骤4-1至步骤4-3。第一计算单元320获得落入数据窗口区域3的输出数据22的过程可以包括步骤5-1与步骤5-2。The process in which the first calculation unit 320 obtains the output data 29 falling in the data window area 1 may include steps 4-1 to 4-3. The process of obtaining the output data 22 falling into the data window area 3 by the first calculation unit 320 may include steps 5-1 and 5-2.
其中,步骤4-1同上文描述的步骤2-1,步骤4-2同上文描述的步骤2-2。Among them, step 4-1 is the same as step 2-1 described above, and step 4-2 is the same as step 2-2 described above.
步骤4-3,缓存行重叠区域(即第3行与第4行)的列处理结果{11,13}的行处理结果{13}。Step 4-3: Cache the row processing result {13} of the column processing result {11, 13} in the overlapping area of the row (ie, the 3rd row and the 4th row).
可选地,步骤4-3可以包含在步骤4-2中。Optionally, step 4-3 may be included in step 4-2.
步骤5-1同上文描述的步骤3-1。Step 5-1 is the same as step 3-1 described above.
步骤5-2,对步骤5-1获得的列处理结果{14,22}与步骤4-3所缓存的行重叠区域的行处理结果{13}进行行处理,获得行处理结果,即获得数据窗口区域3的输出数据。Step 5-2, perform row processing on the column processing result {14,22} obtained in step 5-1 and the row processing result {13} of the row overlap area cached in step 4-3 to obtain the row processing result, that is, to obtain the data Output data of window area 3.
即对步骤5-1获得的列处理结果{14,22}与步骤4-3所缓存的行重叠区域的行处理结果{13}求最大值,获得数据窗口区域3的输出数据22。That is, the maximum value of the column processing result {14, 22} obtained in step 5-1 and the row processing result {13} of the row overlapping area cached in step 4-3 is obtained, and the output data 22 of the data window area 3 is obtained.
在本实施例中,第一计算单元320在获取数据窗口区域3的输出数据的过程中,省去了对获取数据窗口区域3中第一行数据{2,11}与第二行数据{12,13}的读操作,而是直接利用在获取数据窗口区域1的输出数据的过程中所缓存的这两行数据的行处理结果。In this embodiment, in the process of obtaining the output data of the data window area 3, the first calculation unit 320 eliminates the need for the first row of data {2,11} and the second row of data {12, in the obtaining data window area 3. ,13}, but directly use the row processing results of the two rows of data cached in the process of obtaining the output data of the data window area 1.
从上述步骤4-1可知,在获取数据窗口区域1的输出数据的过程中已经执行了对行重复区域内的两行数据{2,11}与{12,13}的读操作了。因此,本实施例在不同数据窗口区域之间具有重叠区域的情形下,可以避免重复读取数据。From the above step 4-1, it can be seen that in the process of obtaining the output data of the data window area 1, the reading operation of the two rows of data {2,11} and {12,13} in the row repeating area has been performed. Therefore, in this embodiment, in the case of overlapping areas between different data window areas, repeated reading of data can be avoided.
此外,本实施例的步骤4-3中仅需缓存重叠区域(即第3行与第4行)的行处理结果{13},相对来说,可以进一步降低第一计算单元320的缓存需求。In addition, in step 4-3 of this embodiment, only the row processing result {13} in the overlapping area (ie, the third row and the fourth row) needs to be cached. Relatively speaking, the cache requirement of the first computing unit 320 can be further reduced.
应理解,上述步骤4-1至步骤4-3,以及步骤5-1至步骤5-2可以为步骤S440的可选实施方式。It should be understood that the foregoing step 4-1 to step 4-3, and step 5-1 to step 5-2 may be optional implementations of step S440.
作为另一个示例,参见图7。如图7所示,左边是感兴趣区域上的某一列数据,右边是该感兴趣区域对应池化输出框(也可以称为输出特征图)中的一列输出数据。其中,像素点a由输入数据{1,2,3,4,5}计算得到,像素点b由输入数据{3,4,5,6,7}计算得到。当逐行处理输入数据时,在计算像素点a的同时,会获得{3,4,5}的计算结果。{3,4,5}的计算结果既是像素点a的中间结果,也是像素点b的中间结果。在这种情形下,在计算像素点a的过程中,缓存{3,4,5}的计算结果,在计算像素点b时,可以直接读取{3,4,5}的计算结果,基于{3,4,5}的计算结果与{6,7}计算得到像素点b。As another example, see Figure 7. As shown in FIG. 7, the left side is a column of data on the region of interest, and the right side is a column of output data in the pooled output box (also referred to as an output feature map) corresponding to the region of interest. Among them, the pixel point a is calculated from the input data {1,2,3,4,5}, and the pixel point b is calculated from the input data {3,4,5,6,7}. When processing the input data line by line, while calculating the pixel a, the calculation result of {3,4,5} will be obtained. The calculation result of {3,4,5} is both the intermediate result of pixel a and the intermediate result of pixel b. In this case, in the process of calculating pixel a, the calculation result of {3,4,5} is cached, and when calculating pixel b, the calculation result of {3,4,5} can be directly read, based on The calculation result of {3,4,5} and {6,7} calculate the pixel point b.
应理解,本实施例通过避免重复读取数据,可以节省带宽,提高计算效率,此外,还可以降低缓存需求。It should be understood that this embodiment can save bandwidth and improve calculation efficiency by avoiding repeated reading of data, and in addition, can also reduce cache requirements.
上文描述了,在步骤S440中,通过先列处理再行处理的方式,对落入 数据窗口区域的数据进行运算处理,获得数据窗口区域的输出数据。本申请不限定于此。例如,在步骤S440中,也可以通过先行处理再列处理的方式,对落入数据窗口区域的数据进行运算处理,获得数据窗口区域的输出数据。As described above, in step S440, the data falling in the data window area is processed by the method of processing first and then processing to obtain the output data of the data window area. This application is not limited to this. For example, in step S440, it is also possible to perform arithmetic processing on the data falling in the data window area by processing first and then column processing to obtain the output data of the data window area.
可选地,步骤S440包括:获取落入数据窗口区域的每一列数据的行处理结果;对所获取的行处理结果进行列处理,获得数据窗口区域的输出数据。Optionally, step S440 includes: obtaining a row processing result of each column of data falling in the data window area; performing column processing on the obtained row processing result to obtain output data in the data window area.
应理解,在通过先行处理再列处理的方式,对落入数据窗口区域的数据进行运算处理的实施例中,在不同数据窗口区域之间具有重叠区域的情形下,也可以实现避免重复读取数据,实现方式与上文实施例的相关描述类似,这里不再赘述。It should be understood that in the embodiment of performing arithmetic processing on the data falling into the data window area by processing first and then column processing, in the case of overlapping areas between different data window areas, it is also possible to avoid repeated reading. The data and the implementation manner are similar to the related description in the above embodiment, and will not be repeated here.
可选地,如图8所示,第一计算单元320中可以包括一个或多个运算模块324,该运算模块324被配置为,对落入数据窗口区域的数据进行运算处理,得到数据窗口区域的输出数据。Optionally, as shown in FIG. 8, the first calculation unit 320 may include one or more arithmetic modules 324, and the arithmetic modules 324 are configured to perform arithmetic processing on the data falling in the data window area to obtain the data window area The output data.
例如,上文实施例中的步骤S440由第一计算单元320中的运算模块324执行。For example, step S440 in the above embodiment is executed by the arithmetic module 324 in the first calculation unit 320.
可选地,第一计算单元320中包括多个运算模块324,其中,每个运算模块324被配置为获取一个输出数据。Optionally, the first calculation unit 320 includes a plurality of arithmetic modules 324, wherein each arithmetic module 324 is configured to obtain one output data.
可选地,第一计算单元320包括的运算模块324的数量可以与池化输出框的宽度相关。Optionally, the number of arithmetic modules 324 included in the first calculation unit 320 may be related to the width of the pooled output box.
例如,池化输出框的分辨率为E(高)×F(宽),第一计算单元320中的运算模块324的数量可以为F。For example, the resolution of the pooled output frame is E (height)×F (width), and the number of arithmetic modules 324 in the first calculation unit 320 may be F.
应理解,第一计算单元320中的运算模块的数量也可以根据实际需求确定,本申请对此不作限定。例如,在实际应用中,可以综合考虑应用的性能需求、资源的占用等因素,来决定第一计算单元320中的运算模块的数量。其中,资源的占用包括如下任一项或多项:存储空间(memory)的占用,装置的体积。It should be understood that the number of computing modules in the first computing unit 320 may also be determined according to actual needs, which is not limited in this application. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing modules in the first computing unit 320. Wherein, the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.
可选地,第一计算单元320具有运算模块可剪裁的功能。换言之,第一计算单元320运算模块的数量可以动态变化。Optionally, the first calculation unit 320 has a function that the calculation module can be tailored. In other words, the number of operation modules of the first calculation unit 320 may be dynamically changed.
因此,由于本申请提供的第一计算单元320具有运算模块可剪裁的功能,从而可以实现基于计算需求灵活调整第一计算单元320中运算模块的数量,从而可以提高计算效率,还可以提高计算资源的利用率。Therefore, since the first calculation unit 320 provided in the present application has the function of cutting calculation modules, it is possible to flexibly adjust the number of calculation modules in the first calculation unit 320 based on calculation requirements, thereby improving calculation efficiency and computing resources. Utilization rate.
第一计算单元320中还可以包括计算控制模块,被配置为执行上文实施 例中的步骤S420与步骤S430。The first calculation unit 320 may also include a calculation control module configured to perform step S420 and step S430 in the above embodiment.
可选地,该计算控制模块与运算模块324可以分离设置,也可以集中设置。例如,运算模块324包含该计算控制模块。Optionally, the calculation control module and the arithmetic module 324 can be set separately or collectively. For example, the calculation module 324 includes the calculation control module.
可选地,如图8所示,第一计算单元320中还可以包括子配置接口321、子数据接口322、存储模块323。Optionally, as shown in FIG. 8, the first calculation unit 320 may further include a sub-configuration interface 321, a sub-data interface 322, and a storage module 323.
子配置接口321被配置为,从配置接口310接收配置信息。The sub-configuration interface 321 is configured to receive configuration information from the configuration interface 310.
例如,配置接口310与子配置接口321均可以为外围总线(advanced peripheral bus,APB)接口。For example, both the configuration interface 310 and the sub-configuration interface 321 may be an advanced peripheral bus (APB) interface.
子数据接口322被配置为,从数据输入接口330接收输入特征图的数据。The sub-data interface 322 is configured to receive data of the input feature map from the data input interface 330.
存储模块323被配置为,缓存中间处理结果。The storage module 323 is configured to cache intermediate processing results.
应理解,存储模块323可以被配置为缓存第一计算单元320在获取第一感兴趣区域的输出数据的过程中需要再利用的中间处理数据。It should be understood that the storage module 323 may be configured to buffer intermediate processing data that needs to be reused in the process of obtaining the output data of the first region of interest by the first calculation unit 320.
作为一个示例,在上文描述的不同数据窗口区域之间具有重叠区域的实施例中,存储模块323可以被配置为,缓存数据窗口区域1与数据窗口区域3具有的行重叠区域上的数据的列处理结果或行处理结果。As an example, in the above-described embodiment in which different data window areas have overlapping areas, the storage module 323 may be configured to cache the data on the line overlapping areas of the data window area 1 and the data window area 3. Column processing result or row processing result.
可选地,存储模块323可以位于运算模块324中。Optionally, the storage module 323 may be located in the calculation module 324.
可选地,该存储模块323还可被配置为,存储子数据接口322接收的输入特征图的数据。Optionally, the storage module 323 can also be configured to store the data of the input feature map received by the sub-data interface 322.
例如,该存储模块323中所存储的输入特征图的数据,可以支持第一计算单元320进行多个感兴趣区域的池化处理。也就是说,第一计算单元320在进行一个或多个感兴趣区域的池化处理的过程中,可以直接从该存储模块323中获取输入特征图的数据,而无需从外部获取。For example, the data of the input feature map stored in the storage module 323 can support the first calculation unit 320 to perform pooling processing of multiple regions of interest. In other words, during the pooling process of one or more regions of interest, the first calculation unit 320 can directly obtain the data of the input feature map from the storage module 323 without obtaining it from the outside.
图9为第一计算单元320的另一示意性框图。第一计算单元320包括子配置接口321、子数据接口322、存储模块323和运算模块324。其中,运算模块324包括图9中324标注的框的左半部分所示的控制电路部分与右半部分所示的运算电路部分。FIG. 9 is another schematic block diagram of the first calculation unit 320. The first calculation unit 320 includes a sub-configuration interface 321, a sub-data interface 322, a storage module 323 and an arithmetic module 324. Among them, the arithmetic module 324 includes the control circuit part shown in the left half of the box marked 324 in FIG. 9 and the arithmetic circuit part shown in the right half of the box.
子配置接口321被配置为,接收配置接口310发送的配置信息,该配置信息指示感兴趣区域的位置。The sub-configuration interface 321 is configured to receive configuration information sent by the configuration interface 310, the configuration information indicating the location of the region of interest.
子数据接口322被配置为,接收数据输入接口330发送的输入特征图的数据。The sub-data interface 322 is configured to receive the data of the input characteristic map sent by the data input interface 330.
存储模块323被配置为,缓存子数据接口322接收的数据,还可以用于 缓存中间处理结果。The storage module 323 is configured to buffer the data received by the sub-data interface 322, and can also be used to buffer intermediate processing results.
运算模块324中的控制电路部分被配置为,执行上文实施例中描述的步骤S420与步骤S430。The control circuit part in the arithmetic module 324 is configured to execute step S420 and step S430 described in the above embodiment.
例如,运算模块324中的控制电路部分被配置为,基于感兴趣区域的位置,以及池化输出框的分辨率,计算当前要计算的数据窗口区域的起始坐标(w_start_floor)与结束坐标(w_end_ceil)。For example, the control circuit part in the arithmetic module 324 is configured to calculate the start coordinates (w_start_floor) and the end coordinates (w_end_ceil ).
运算模块324中的运算电路部分被配置为,执行上文实施例中描述的步骤S440。The arithmetic circuit part in the arithmetic module 324 is configured to execute step S440 described in the above embodiment.
如图9所示,运算电路部分包括多个运算电路。As shown in FIG. 9, the arithmetic circuit part includes a plurality of arithmetic circuits.
在池化处理的运算方式为求最大值的场景下,运算电路被配置为具有比较功能。以一个运算电路为例,该运算电路具有一个或多个输入端和一个输出端,该运算电路可以对输入端输入的数据进行比较操作,获得最大值,并将最大值输出到输出端。例如,运算电路可以通过比较电路或比较运算器构成的电路实现。In the scenario where the calculation method of the pooling process is to find the maximum value, the calculation circuit is configured to have a comparison function. Take an arithmetic circuit as an example. The arithmetic circuit has one or more input terminals and an output terminal. The arithmetic circuit can compare the data input at the input terminal to obtain the maximum value, and output the maximum value to the output terminal. For example, the arithmetic circuit can be realized by a circuit composed of a comparison circuit or a comparison operator.
在池化处理的运算方式为求平均值的场景下,运算电路被配置为具有求累加和与求平均值的功能。以一个运算电路为例,该运算电路具有一个或多个输入端和一个输出端,该运输电路可以对输入端输入的数据进行累加操作,获得累加和,并将累加和输出到输出端,该运算电路还可以对累加和进行求平均操作,获得平均值,并将平均值输出到输出端。例如,运算电路可以由加法器和乘法器实现。In the scenario where the calculation method of the pooling process is averaging, the arithmetic circuit is configured to have the functions of calculating accumulation and averaging. Take an arithmetic circuit as an example. The arithmetic circuit has one or more input terminals and an output terminal. The transport circuit can accumulate the data input at the input terminal to obtain the accumulated sum, and output the accumulated sum to the output terminal. The arithmetic circuit can also perform an averaging operation on the accumulated sum to obtain the average value, and output the average value to the output terminal. For example, the arithmetic circuit can be realized by an adder and a multiplier.
继续参见图3,计算装置300可以包括数据输入接口330。配置接口310还被配置为,向数据输入接口330传输指示输入特征图在外部存储设备中的起始位置,以及指示输入特征图的分辨率的配置信息。数据输入接口330被配置为:根据起始位置,以及输入特征图的分辨率,从外部存储设备中读取输入特征图的数据;将读取的输入特征图的数据广播至N个计算单元320中。Continuing to refer to FIG. 3, the computing device 300 may include a data input interface 330. The configuration interface 310 is also configured to transmit configuration information indicating the starting position of the input feature map in the external storage device and the resolution of the input feature map to the data input interface 330. The data input interface 330 is configured to read the data of the input feature map from an external storage device according to the starting position and the resolution of the input feature map; and broadcast the read data of the input feature map to N computing units 320 in.
应理解,通过数据输入接口330将读取的输入特征图的数据广播至N个计算单元320中,使得数据输入接口330从外部存储设备进行一次读数据操作,就可以使得N个计算单元320均获取所读取的数据,因此,可以避免重复读取数据,可以节省带宽。It should be understood that by broadcasting the read data of the input characteristic map to the N computing units 320 through the data input interface 330, so that the data input interface 330 performs a data reading operation from the external storage device once, so that the N computing units 320 can be shared. Obtain the read data, therefore, repeated reading of data can be avoided, and bandwidth can be saved.
计算装置300可以为片上系统。应理解,片上系统的存储资源通常较小, 一般需要从外部存储设备获取待处理的数据。在本申请中,计算装置300可以从外部存储设备获取ROI-Pooling层的输入数据,即输入特征图的数据。The computing device 300 may be a system on a chip. It should be understood that the storage resources of the system-on-chip are generally small, and the data to be processed generally needs to be obtained from an external storage device. In this application, the computing device 300 may obtain the input data of the ROI-Pooling layer, that is, the data of the input feature map from an external storage device.
外部存储设备可以为双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM)。DDR SDRAM可以简称为DDR内存或DDR。应理解,本文对外部存储设备的实现不作限定。The external storage device may be Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM). DDR SDRAM can be referred to as DDR memory or DDR for short. It should be understood that this article does not limit the implementation of the external storage device.
数据输入接口330可以被配置为按照外部存储设备的数据存储格式读数据。The data input interface 330 may be configured to read data according to the data storage format of the external storage device.
可选地,输入特征图的数据在外部存储设备中的存储格式为,每行输入特征数据以X比特对齐的方式保存,X为8的倍数。输入数据接口330被配置为,以X比特×L的突发方式从外部存储设备中读取输入特征数据,L为正整数。例如,X的取值为128。Optionally, the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The input data interface 330 is configured to read input characteristic data from an external storage device in a burst of X bits×L, where L is a positive integer. For example, the value of X is 128.
作为示例,在外部存储设备中,每行数据按照128比特(bit)对齐的方式保存。假设数据的量化位宽为8bit,则每16个数据对齐保存在外部存储设备的一个地址中,行末不足的填充无效数据以保证下一行的起始地址是128bit对齐的。在这种情况下,数据输入接口330被配置为,以128bit×L的突发(burst)方式读取外部存储设备中的数据,即以突发的方式访问外部存储设备,每次访问外部存储设备都是按照128bit的颗粒度。其中,L为正整数。例如,L为小于或等于8的正整数,即数据输入接口330每次突发最多访问8个地址的数据。As an example, in an external storage device, each row of data is stored in a 128-bit (bit) aligned manner. Assuming that the quantization bit width of the data is 8bit, every 16 data is aligned and stored in an address of the external storage device, and the insufficient data at the end of the line is filled with invalid data to ensure that the starting address of the next line is 128bit aligned. In this case, the data input interface 330 is configured to read the data in the external storage device in a burst of 128bit×L, that is, to access the external storage device in a burst mode, each time the external storage is accessed. Devices are based on the granularity of 128bit. Among them, L is a positive integer. For example, L is a positive integer less than or equal to 8, that is, the data input interface 330 accesses data of 8 addresses at most in each burst.
应理解,本申请提供的计算装置从外部存储设备读取数据的方式可以适应于外部存储设备的数据存储格式,从而可以提高数据读取效率。It should be understood that the manner in which the computing device provided in the present application reads data from the external storage device can be adapted to the data storage format of the external storage device, thereby improving the efficiency of data reading.
在输入特征图为三维的情况下,例如,输入特征图如图1所示的输入立方体,数据输入接口330读取输入特征图的顺序为,逐张顺序读取。In the case where the input feature map is three-dimensional, for example, the input feature map is an input cube as shown in FIG. 1, and the data input interface 330 reads the input feature map in sequence, one by one.
可选地,计算装置300还包括缓存单元(图3中未示出)。数据输入接口330被配置为:以行主序,从外部存储设备中并行读取输入特征图的数据;将并行读取的输入特征图的数据缓存到缓存单元中;对缓存单元中的输入特征图的数据进行并-串转换处理;将并-串转换处理得到的输入特征图的数据广播至N个计算单元320中。Optionally, the computing device 300 further includes a cache unit (not shown in FIG. 3). The data input interface 330 is configured to: read the data of the input feature map in parallel from the external storage device in line-major order; cache the data of the input feature map read in parallel in the cache unit; The data of the graph undergoes parallel-serial conversion processing; the data of the input characteristic map obtained by the parallel-serial conversion processing is broadcast to N computing units 320.
其中,以行主序,从外部存储设备中并行读取输入特征图的数据,表示,以行为粒度,从外部存储设备中并行读取输入特征图的数据。Among them, the data of the input feature map is read in parallel from the external storage device in the main sequence of rows, which means that the data of the input feature map is read in parallel from the external storage device with the behavior granularity.
例如,数据输入接口330可以被配置为,针对一张输入特征图,按照Z字形扫描的方式读取数据。For example, the data input interface 330 may be configured to read data in a zigzag scanning manner for an input feature map.
例如,数据输入接口330可以将并-串转换处理得到的输入特征图的数据按照光栅顺序广播至N个计算单元320中。For example, the data input interface 330 may broadcast the data of the input feature map obtained by the parallel-serial conversion process to the N computing units 320 in raster order.
应理解,数据输入接口330采用缓存单元缓存输入特征图的数据,在实现数据缓存的同时,还可以解决内、外数据处理速度不匹配的问题。It should be understood that the data input interface 330 adopts a buffer unit to buffer the data of the input feature map. While realizing data buffering, it can also solve the problem of internal and external data processing speed mismatch.
该缓存单元可以位于数据输入接口330中。The buffer unit may be located in the data input interface 330.
可选地,该缓存单元可以为先进先出(first input first output,FIFO)模块。FIFO模块可以是FIFO存储器,也可以是FIFO队列。Optionally, the buffer unit may be a first input first output (first input first output, FIFO) module. The FIFO module can be a FIFO memory or a FIFO queue.
FIFO模块的数据位宽可以根据外部存储设备的数据存储格式进行设计。The data bit width of the FIFO module can be designed according to the data storage format of the external storage device.
例如,在外部存储设备中,每行数据按照128比特(bit)对齐的方式保存,FIFO模块中的数据位宽为128bit。For example, in an external storage device, each row of data is stored in a 128-bit (bit) alignment manner, and the data bit width in the FIFO module is 128 bits.
应理解,若外部存储设备的数据存储格式为,每行数据按照128比特(bit)对齐的方式保存,行末不足的填充无效数据以保证下一行的起始地址是128bit对齐的,则数据输入接口330可以在将存储单元存储的数据进行并行转串行处理的同时,剔除掉填充的无效数据。It should be understood that if the data storage format of the external storage device is such that each row of data is stored in a 128-bit (bit) alignment, and the insufficient data at the end of the row is filled with invalid data to ensure that the starting address of the next row is 128-bit aligned, then the data input interface 330 can eliminate the filled invalid data while converting the data stored in the storage unit from parallel to serial processing.
该缓存单元可以与数据输入接口330分离设置,也可以集成设置。例如,该缓存单元可以位于数据输入接口330中。The buffer unit can be provided separately from the data input interface 330, or can be integrated. For example, the buffer unit may be located in the data input interface 330.
可选地,计算装置300包括的计算单元320的数量可以与数据输入接口330读取数据的颗粒度以及计算单元320在每个时钟周期处理像素点的数量相关。Optionally, the number of computing units 320 included in the computing device 300 may be related to the granularity of the data read by the data input interface 330 and the number of pixels processed by the computing unit 320 in each clock cycle.
作为一个示例,假设计算单元320在每个时钟周期处理1个像素点,若数据输入接口330读取数据的颗粒度为128比特(假设数据的量化位宽为8比特),则计算装置300包括的计算单元320的数量可以设置为16,即S的取值可以为16。As an example, assuming that the calculation unit 320 processes 1 pixel in each clock cycle, and if the granularity of the data read by the data input interface 330 is 128 bits (assuming the quantization bit width of the data is 8 bits), the calculation device 300 includes The number of calculation units 320 can be set to 16, that is, the value of S can be 16.
作为另一个示例,假设计算单元320在每个时钟周期处理1个像素点,若数据输入接口330读取数据的颗粒度为128比特×8(假设数据的量化位宽为8比特),则计算装置300包括的计算单元320的数量可以设置为128,即S的取值可以为128。As another example, suppose the calculation unit 320 processes 1 pixel in each clock cycle, and if the granularity of the data read by the data input interface 330 is 128 bits×8 (assuming the quantization bit width of the data is 8 bits), then calculate The number of calculation units 320 included in the device 300 may be set to 128, that is, the value of S may be 128.
需要说明的是,在本申请中,计算装置300包括的计算单元320的数量可以根据实际需求而确定。例如,在实际应用中,可以综合考虑应用的性能 需求、资源的占用等因素,来决定计算装置300包括的计算单元320的数量。其中,资源的占用包括如下任一项或多项:存储空间(memory)的占用,装置的体积。It should be noted that, in this application, the number of computing units 320 included in the computing device 300 may be determined according to actual requirements. For example, in practical applications, factors such as application performance requirements and resource occupation can be comprehensively considered to determine the number of computing units 320 included in the computing device 300. Wherein, the occupation of resources includes any one or more of the following: occupation of storage space (memory), and volume of the device.
可选地,计算装置300具有计算单元320可剪裁的功能。换言之,计算装置300中计算单元320的数量可以动态变化。Optionally, the computing device 300 has a function that the computing unit 320 can tailor. In other words, the number of computing units 320 in the computing device 300 can be dynamically changed.
例如,在待处理的感兴趣区域较少的情况下,可以设置计算装置300具有较少数量的计算单元320;在待处理的感兴趣区域较多的情况下,可以设置计算装置300具有较多数量的计算单元320。For example, when there are fewer regions of interest to be processed, the computing device 300 can be set to have a smaller number of computing units 320; when there are many regions of interest to be processed, the computing device 300 can be set to have more computing units. The number of calculation units 320.
应理解,计算装置300具有计算单元320可剪裁的功能,使得计算装置300可以灵活适用于具有不同计算要求的ROI-pooling层。It should be understood that the computing device 300 has a tailorable function of the computing unit 320, so that the computing device 300 can be flexibly adapted to ROI-pooling layers with different computing requirements.
应理解,计算装置300包括的计算单元320越多,该计算装置300所需的存储空间越大,同时计算装置300的整体体积也越大;计算装置300包括的计算单元320越少,该计算装置300所需的存储空间越小,同时计算装置300的整体体积也越小。因此,由于本申请提供的计算装置300具有计算单元320可剪裁的功能,从而可以实现基于计算需求灵活调整计算单元320的数量,这样可以在保证计算性能的基础上,有效节省资源的占用。It should be understood that the more computing units 320 included in the computing device 300, the larger the storage space required by the computing device 300, and the larger the overall volume of the computing device 300; the fewer computing units 320 included in the computing device 300, the greater the storage space required by the computing device 300. The smaller the storage space required by the device 300, the smaller the overall volume of the computing device 300. Therefore, since the computing device 300 provided in the present application has the function of tailoring the computing unit 320, the number of computing units 320 can be flexibly adjusted based on computing requirements, which can effectively save resource occupation while ensuring computing performance.
还应理解,基于应用需求,计算装置300包括的计算单元320的数量S也可以为1。It should also be understood that, based on application requirements, the number S of computing units 320 included in the computing device 300 may also be one.
继续参见图3,可选地,计算装置300还包括数据输出接口340,被配置为将N个计算单元320计算得到的输出数据输出到外部存储设备中。Continuing to refer to FIG. 3, optionally, the computing device 300 further includes a data output interface 340 configured to output the output data calculated by the N computing units 320 to an external storage device.
可选地,输入特征图的数据在外部存储设备中的存储格式为,每行输入特征数据以X比特对齐的方式保存,X为8的倍数。数据输出接口330被配置为,将S个计算单元的输出数据拼接成X比特进行对齐缓存,并将对齐缓存的数据输出至外部存储设备中。Optionally, the storage format of the data of the input feature map in the external storage device is that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The data output interface 330 is configured to splice the output data of the S computing units into X bits for alignment buffering, and output the aligned buffered data to an external storage device.
可选地,N为大于1的整数,如图10所示,计算装置300还可以包括仲裁单元350,被配置为将N个计算单元320计算得到的输出数据按照预设顺序依次传输至数据输出接口340。Optionally, N is an integer greater than 1. As shown in FIG. 10, the computing device 300 may further include an arbitration unit 350 configured to sequentially transmit the output data calculated by the N computing units 320 to the data output in a preset order. Interface 340.
例如,仲裁单元350可以被配置为,使用公平轮询算法把对齐后的数据传输至数据输出接口340。For example, the arbitration unit 350 may be configured to transmit the aligned data to the data output interface 340 using a fair polling algorithm.
应理解,对于多个计算单元320的输出数据,采用仲裁单元350,将其按照预设顺序传输至数据输出接口340,有利于后续流程中对数据的管理。It should be understood that for the output data of the multiple computing units 320, the arbitration unit 350 is used to transmit it to the data output interface 340 in a preset order, which is beneficial to the management of the data in the subsequent process.
可选地,配置接口310被配置为,在计算装置300完成N个感兴趣区域的池化处理之后,向S个计算单元320中的P个计算单元320传输指示P个感兴趣区域的位置的配置信息,P个感兴趣区域与P个计算单元320一一对应,P为小于或等于S的正整数。Optionally, the configuration interface 310 is configured to, after the computing device 300 completes the pooling process of the N regions of interest, transmit data indicating the positions of the P regions of interest to the P computing units 320 of the S computing units 320 In the configuration information, the P regions of interest correspond to the P calculation units 320 one-to-one, and P is a positive integer less than or equal to S.
在一张输入特征图上的感兴趣区域的数量大于N的情况下,该P个感兴趣区域为当前张输入特征图上未进行池化处理的感兴趣区域。When the number of regions of interest on an input feature map is greater than N, the P regions of interest are regions of interest on the current input feature map that have not been pooled.
在ROI-pooling层的输入数据为多张输入特征图(如图1所示的L张)的情况下,该P个感兴趣区域可以为下一张输入特征图上的感兴趣区域。In the case where the input data of the ROI-pooling layer is multiple input feature maps (L as shown in FIG. 1), the P regions of interest may be regions of interest on the next input feature map.
在一张输入特征图上的感兴趣区域的数量大于N、且ROI-pooling层的输入数据为多张输入特征图的情况下,该P个感兴趣区域可以为当前张输入特征图上未进行池化处理的感兴趣区域,或者,可以为下一张输入特征图上的感兴趣区域。In the case where the number of regions of interest on an input feature map is greater than N, and the input data of the ROI-pooling layer is multiple input feature maps, the P regions of interest can be unprocessed on the current input feature map The region of interest for pooling processing can alternatively be the region of interest on the next input feature map.
换言之,每条配置指令执行完成后,下一条指令是优先切换下一张输入特征图还是优先切换当前张输入特征图上未进行池化处理的P个感兴趣区域,可以通过指令动态配置。在实际应用中,可以根据实际需求分析两种切换顺序的计算速率和对带宽的要求,从而选择较优的切换顺序。In other words, after the execution of each configuration instruction is completed, whether the next instruction is to preferentially switch to the next input feature map or to preferentially switch the P regions of interest on the current input feature map that have not been pooled can be dynamically configured by the instruction. In practical applications, the calculation rate and bandwidth requirements of the two switching sequences can be analyzed according to actual needs, so as to select a better switching sequence.
基于上文描述,本申请提供的计算装置通过包括多个计算单元,可以支持实现多个感兴趣区域的并行池化处理,因此,可以提高ROI-pooling层的处理效率。Based on the above description, the computing device provided by the present application can support parallel pooling processing of multiple regions of interest by including multiple computing units, and therefore, can improve the processing efficiency of the ROI-pooling layer.
应理解,本文中提及的图像或区域的分辨率或尺寸的单位均为像素。例如,在图2中,输入特征图的分辨率为8×8(单位:像素),感兴趣区域的分辨率为7×5(单位:像素),输出池化框的大小为2×2(单位:像素)。It should be understood that the unit of the resolution or size of the image or region mentioned in this article is all pixels. For example, in Figure 2, the resolution of the input feature map is 8×8 (unit: pixel), the resolution of the region of interest is 7×5 (unit: pixel), and the size of the output pooling frame is 2×2 ( Unit: pixels).
本申请提供的计算装置300为可以为专用集成电路(application specific integrated circuit,ASIC)或现场可编程门阵列(field-programmable gate array,FPGA)。The computing device 300 provided in this application may be an application specific integrated circuit (ASIC) or a field-programmable gate array (FPGA).
本申请提供的计算装置300可以应用于实现卷积神经网络(convolution neural network,CNN)中ROI-pooling层的硬加速功能。The computing device 300 provided in this application can be applied to implement the hard acceleration function of the ROI-pooling layer in a convolution neural network (CNN).
作为一个示例,计算装置300可应用于知识产权(intellectual property,IP)核以及IP核之间的协同工作电路。As an example, the computing device 300 can be applied to an intellectual property (IP) core and a cooperative working circuit between the IP core.
作为另一个实施例,计算装置300也可以应用在包ROI-pooling层的其他类型的神经网络加速器或处理器中。As another embodiment, the computing device 300 may also be applied to other types of neural network accelerators or processors that include the ROI-pooling layer.
上文描述了本申请提供的装置实施例,下文将描述本申请提供的方法实施例。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的内容可以参见上文装置实施例,为了简洁,这里不再赘述。The device embodiments provided by the present application are described above, and the method embodiments provided by the present application will be described below. It should be understood that the description of the method embodiment and the description of the device embodiment correspond to each other. Therefore, for the content that is not described in detail, please refer to the above device embodiment. For brevity, details are not repeated here.
图11为本申请实施例提供的ROI-pooling层的计算方法的示意性流程图。例如,计算方法可以由上文实施例的计算装置300执行。该计算方法包括如下步骤S1110与步骤S1120。FIG. 11 is a schematic flowchart of a method for calculating an ROI-pooling layer provided by an embodiment of the application. For example, the calculation method may be executed by the calculation device 300 of the above embodiment. The calculation method includes the following steps S1110 and S1120.
S1110,获取指示N个感兴趣区域的位置的配置信息,N为正整数。S1110. Acquire configuration information indicating the positions of N regions of interest, where N is a positive integer.
S1120,根据配置信息,对N个感兴趣区域进行并行池化处理,获得相应感兴趣区域的输出数据。S1120: Perform parallel pooling processing on the N regions of interest according to the configuration information to obtain output data of the corresponding regions of interest.
在步骤S1120中,包括对N个感兴趣区域的池化处理,其中,对第一感兴趣区域进行池化处理的方法包括如上文描述步骤S410至步骤S440。为了简洁,这里不再赘述。应理解,第一感兴趣区域表示N个感兴趣区域中的每个感兴趣区域。In step S1120, the pooling process for N regions of interest is included, wherein the method for pooling the first region of interest includes steps S410 to S440 as described above. For the sake of brevity, I won't repeat them here. It should be understood that the first region of interest represents each of the N regions of interest.
可选地,在图11所示实施例中,步骤S1120结合上文描述步骤S440,具体可以包括:获取落入数据窗口区域的每一行数据的列处理结果;对列处理结果进行行处理,获得数据窗口区域的输出数据。Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: obtaining the column processing result of each row of data falling in the data window area; performing row processing on the column processing result to obtain Output data in the data window area.
可选地,在图11所示实施例中,步骤S1120结合上文描述步骤S440,具体可以包括:若第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取第一数据窗口区域的输出数据的过程中,缓存行重叠区域的第一列处理结果;在计算第二数据窗口区域的输出数据的过程中,对第二数据窗口区域中除行重叠区域之外的行数据进行列处理,获得第二列处理结果,对第二列处理结果与所缓存的第一列处理结果进行行处理,获得第二数据窗口的输出数据。Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the processing result of the first column of the line overlap area; in the process of calculating the output data of the second data window area, remove the line overlap area in the second data window area Perform column processing on the other row data to obtain the second column processing result, perform row processing on the second column processing result and the cached first column processing result, to obtain the output data of the second data window.
可选地,在图11所示实施例中,步骤S1120结合上文描述步骤S440,具体可以包括:若第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取第一数据窗口区域的输出数据的过程中,缓存行重叠区域的第一列处理结果的行处理结果;在计算第二数据窗口区域的输出数据的过程中,对第二数据窗口区域中除行重叠区域之外的行数据进行列处理,获得第二列处理结果,对第二列处理结果与所缓存的行重叠区域的第一列处理结果的行处理结果进行行处理,获得第二数据窗口的输出数据。Optionally, in the embodiment shown in FIG. 11, step S1120 combined with step S440 described above may specifically include: if the first region of interest includes a first data window region and a second data window region with overlapping rows, In the process of obtaining the output data of the first data window area, cache the row processing result of the first column of the processing result of the line overlap area; in the process of calculating the output data of the second data window area, the data in the second data window area Perform column processing on the row data other than the row overlap area to obtain the second column processing result, perform row processing on the row processing result of the second column processing result and the first column processing result of the cached row overlap area, to obtain the second column processing result. The output data of the data window.
可选地,在图11所示实施例中,步骤S1120包括:通过包括S个计算 单元的计算装置中的N个计算单元对N个感兴趣区域进行并行池化处理,其中,N个计算单元与N个感兴趣区域一一对应,S为大于1的整数,N为小于S的整数。Optionally, in the embodiment shown in FIG. 11, step S1120 includes: performing parallel pooling processing on N regions of interest by N computing units in a computing device including S computing units, where N computing units There is a one-to-one correspondence with N regions of interest, S is an integer greater than 1, and N is an integer less than S.
可以理解,除了在硬件层面上利用N个计算单元对N个感兴趣区域进行并行池化处理之外,也可以通过软件层面进行,以提高感兴趣区域的处理效率,此处不做具体限定。It can be understood that, in addition to using N computing units to perform parallel pooling processing on N regions of interest at the hardware level, it can also be performed at the software level to improve the processing efficiency of the regions of interest, which is not specifically limited here.
该计算装置例如为上文实施例中的计算装置300。该N个计算单元例如为上文实施例中的N个计算单元320。The computing device is, for example, the computing device 300 in the above embodiment. The N calculation units are, for example, the N calculation units 320 in the above embodiment.
例如,该计算单元在每个时钟周期处理一个像素点。For example, the calculation unit processes one pixel in each clock cycle.
可选地,该计算单元包括运算模块,用于对落入相应数据窗口的数据进行运算处理。Optionally, the calculation unit includes an arithmetic module for performing arithmetic processing on the data falling into the corresponding data window.
例如,计算单元包括的运算模块的数量与池化输出框的宽度相关。For example, the number of arithmetic modules included in the calculation unit is related to the width of the pooled output box.
其中,该运算模块例如为上文实施例中的运算模块324。Wherein, the operation module is, for example, the operation module 324 in the above embodiment.
可选地,在图11所示实施例中,该方法还包括通过数据输入接口执行如下步骤:获取指示输入特征图在外部存储设备中的起始位置,以及指示输入特征图的分辨率的配置信息;根据起始位置,以及输入特征图的分辨率,从外部存储设备中读取输入特征图的数据,将读取的所述输入特征图的数据广播至所述N个计算单元中。Optionally, in the embodiment shown in FIG. 11, the method further includes performing the following steps through the data input interface: obtaining the starting position of the input feature map in the external storage device, and the configuration indicating the resolution of the input feature map Information; according to the starting position and the resolution of the input feature map, read the data of the input feature map from an external storage device, and broadcast the read data of the input feature map to the N computing units.
其中,该数据输入接口例如为上文实施例中的数据输入接口330。Wherein, the data input interface is, for example, the data input interface 330 in the above embodiment.
可选地,在图11所示实施例中,根据起始位置,以及输入特征图的分辨率,从外部存储设备中读取输入特征图的数据,包括:以行主序,从外部存储设备中并行读取输入特征图的数据。将读取的所述输入特征图的数据广播至所述N个计算单元中,包括:将并行读取的输入特征图的数据缓存到缓存单元中;对缓存单元中的输入特征图的数据进行并-串转换处理;将所述并-串转换处理得到的输入特征图的数据广播至所述N个计算单元中。Optionally, in the embodiment shown in FIG. 11, according to the starting position and the resolution of the input feature map, reading the data of the input feature map from the external storage device includes: in line main sequence, from the external storage device Read the data of the input feature map in parallel. Broadcasting the read data of the input feature map to the N computing units includes: buffering the data of the input feature map read in parallel in a cache unit; performing processing on the data of the input feature map in the cache unit Parallel-serial conversion processing; broadcasting the data of the input feature map obtained by the parallel-serial conversion processing to the N computing units.
可选地,该缓存单元与计算单元是分离的,同时,该缓存单元可以与数据输入接口分离设置,也可以集成设置。例如,该缓存单元可以位于数据输入接口中。Optionally, the caching unit and the computing unit are separated, and at the same time, the caching unit can be provided separately from the data input interface, or can be integrated. For example, the buffer unit may be located in the data input interface.
例如,该计算装置包括的计算单元的数量S与数据输入接口读取数据的颗粒度以及计算单元在每个时钟周期处理像素点的数量相关。For example, the number S of computing units included in the computing device is related to the granularity of the data read by the data input interface and the number of pixels processed by the computing unit in each clock cycle.
可选地,在图11所示实施例中,该计算方法还包括:通过数据输出接 口,将N个感兴趣区域的输出数据输出到外部存储设备中。Optionally, in the embodiment shown in FIG. 11, the calculation method further includes: outputting the output data of the N regions of interest to the external storage device through the data output interface.
例如,该计算方法还包括:通过仲裁单元,将N个感兴趣区域的输出数据按照预设顺序依次传输至该数据输出接口。For example, the calculation method further includes: sequentially transmitting the output data of the N regions of interest to the data output interface in a preset order through the arbitration unit.
其中,该数据输出接口例如为上文实施例中的数据输出接口340,该仲裁单元例如为上文实施例中的仲裁单元350。The data output interface is, for example, the data output interface 340 in the above embodiment, and the arbitration unit is, for example, the arbitration unit 350 in the above embodiment.
可选地,在图11所示实施例中,在完成N个感兴趣区域的池化处理之后,该计算方法还包括:获取指示P个感兴趣区域的位置的配置信息,其中,P个感兴趣区域为当前张输入特征图上未进行池化处理的感兴趣区域,或者,P个感兴趣区域为下一张输入特征图上的感兴趣区域,P为正整数。Optionally, in the embodiment shown in FIG. 11, after the pooling process of the N regions of interest is completed, the calculation method further includes: acquiring configuration information indicating the positions of the P regions of interest, where P senses The region of interest is the region of interest that has not been pooled on the current input feature map, or P regions of interest are the regions of interest on the next input feature map, and P is a positive integer.
在一张输入特征图上的感兴趣区域的数量大于N的情况下,该P个感兴趣区域为当前张输入特征图上未进行池化处理的感兴趣区域。When the number of regions of interest on an input feature map is greater than N, the P regions of interest are regions of interest on the current input feature map that have not been pooled.
在ROI-pooling层的输入数据为多张输入特征图(如图1所示的L张)的情况下,该P个感兴趣区域可以为下一张输入特征图上的感兴趣区域。In the case where the input data of the ROI-pooling layer is multiple input feature maps (L as shown in FIG. 1), the P regions of interest may be regions of interest on the next input feature map.
在一张输入特征图上的感兴趣区域的数量大于N、且ROI-pooling层的输入数据为多张输入特征图的情况下,该P个感兴趣区域可以为当前张输入特征图上未进行池化处理的感兴趣区域,或者,可以为下一张输入特征图上的感兴趣区域。In the case where the number of regions of interest on an input feature map is greater than N, and the input data of the ROI-pooling layer is multiple input feature maps, the P regions of interest can be unprocessed on the current input feature map The region of interest for pooling processing can alternatively be the region of interest on the next input feature map.
换言之,每条配置指令执行完成后,下一条指令是优先切换下一张输入特征图还是优先切换当前张输入特征图上未进行池化处理的P个感兴趣区域,可以通过指令动态配置。在实际应用中,可以根据实际需求分析两种切换顺序的计算速率和对带宽的要求,从而选择较优的切换顺序。In other words, after the execution of each configuration instruction is completed, whether the next instruction is to preferentially switch to the next input feature map or to preferentially switch the P regions of interest on the current input feature map that have not been pooled can be dynamically configured by the instruction. In practical applications, the calculation rate and bandwidth requirements of the two switching sequences can be analyzed according to actual needs, so as to select a better switching sequence.
上文描述了在图11所示的实施例中,步骤S1120是通过N个计算单元实现的。可选地,该步骤S1120也可以通过软件实现。It is described above that in the embodiment shown in FIG. 11, step S1120 is implemented by N computing units. Optionally, this step S1120 can also be implemented by software.
图12为本申请实施例提供的神经网络系统1200的示意性框图,该神经网络系统1200包括感兴趣区域-池化层的计算装置1210,该计算装置1210如上文实施例中的计算装置300。FIG. 12 is a schematic block diagram of a neural network system 1200 provided by an embodiment of the application. The neural network system 1200 includes a region-of-interest-pooling layer computing device 1210, and the computing device 1210 is the computing device 300 in the above embodiment.
应理解,神经网络系统1200还可以包括其他神经网络层的计算装置1220。It should be understood that the neural network system 1200 may also include other neural network layer computing devices 1220.
例如,计算装置1220包括如下任一个或多个计算装置:卷积层的计算装置、激活层的计算装置、池化层的计算装置、全连接层的计算装置。For example, the computing device 1220 includes any one or more of the following computing devices: a computing device in a convolutional layer, a computing device in an activation layer, a computing device in a pooling layer, and a computing device in a fully connected layer.
本文中提及的计算装置也可以称为硬件加速器。The computing devices mentioned herein can also be referred to as hardware accelerators.
可以理解,本文中提供的ROI-pooling层的计算方法以及神经网络系统的有益效果可以参照上文实施例中的感兴趣区域-池化层的计算装置的说明,此处不再赘述。It can be understood that the calculation method of the ROI-pooling layer and the beneficial effects of the neural network system provided in this article can refer to the description of the calculation device of the region of interest-pooling layer in the above embodiment, and will not be repeated here.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present invention are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc. .
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。A person of ordinary skill in the art may realize that the units and algorithm steps of the examples described in combination with the embodiments disclosed herein can be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作 为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (34)

  1. 一种感兴趣区域-池化层的计算装置,其特征在于,所述计算装置包括配置接口与S个计算单元,所述S为大于1的整数;A computing device for a region of interest-pooling layer, wherein the computing device includes a configuration interface and S computing units, and the S is an integer greater than 1;
    所述配置接口被配置为,向所述S个计算单元中的N个计算单元传输指示N个感兴趣区域的位置的配置信息,其中,所述N个感兴趣区域与所述N个计算单元一一对应,所述N为小于或等于所述S的正整数;The configuration interface is configured to transmit configuration information indicating positions of N regions of interest to N computing units of the S computing units, wherein the N regions of interest and the N computing units One-to-one correspondence, the N is a positive integer less than or equal to the S;
    所述N个计算单元中每个计算单元被配置为,对与之对应的感兴趣区域进行池化处理,获得对应感兴趣区域的输出数据。Each of the N computing units is configured to perform pooling processing on the region of interest corresponding to it to obtain output data of the corresponding region of interest.
  2. 根据权利要求1所述的计算装置,其特征在于,所述N个计算单元中的第一计算单元被配置为,对第一感兴趣区域进行池化处理,获得所述第一感兴趣区域的输出数据;The computing device according to claim 1, wherein the first computing unit of the N computing units is configured to perform pooling processing on the first region of interest to obtain the value of the first region of interest Output Data;
    其中,所述对第一感兴趣区域进行池化处理,获得所述第一感兴趣区域的输出数据,包括:Wherein, the performing pooling processing on the first region of interest to obtain output data of the first region of interest includes:
    获取输入特征图的数据,所述输入特征图包括K个感兴趣区域,所述K为不小于所述N的正整数;Acquiring data of an input feature map, the input feature map including K regions of interest, and the K is a positive integer not less than the N;
    根据所述第一感兴趣区域的位置,以及池化输出框的分辨率,获得所述第一感兴趣区域的待输出数据在所述第一感兴趣区域上对应的数据窗口区域;Obtaining, according to the position of the first region of interest and the resolution of the pooled output frame, a data window area corresponding to the data to be output of the first region of interest on the first region of interest;
    从所获取的输入特征图的数据中选择落入所述数据窗口区域的数据;Selecting data that falls into the data window area from the acquired data of the input feature map;
    对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据。Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area.
  3. 根据权利要求2所述的计算装置,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:
    获取落入所述数据窗口区域的每一行数据的列处理结果;Acquiring a column processing result of each row of data falling in the data window area;
    对所述列处理结果进行行处理,获得所述数据窗口区域的输出数据。Perform row processing on the column processing result to obtain output data in the data window area.
  4. 根据权利要求2所述的计算装置,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:
    若所述第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,则在获取所述第一数据窗口区域的输出数据的过程中,缓存所述行重叠区域的第一列处理结果;If the first region of interest includes a first data window region and a second data window region having a line overlapping region, in the process of obtaining the output data of the first data window region, buffering the line overlapping region The first column of processing results;
    在计算所述第二数据窗口区域的输出数据的过程中,对所述第二数据窗 口区域中除所述行重叠区域之外的行数据进行列处理,获得第二列处理结果,对所述第二列处理结果与所缓存的所述第一列处理结果进行行处理,获得所述第二数据窗口的输出数据。In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the The processing result of the second column and the buffered processing result of the first column are processed to obtain the output data of the second data window.
  5. 根据权利要求2所述的计算装置,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:,包括:The computing device according to claim 2, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises: comprising:
    若所述第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取所述第一数据窗口区域的输出数据的过程中,缓存所述行重叠区域的第一列处理结果的行处理结果;If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached The row processing result of a column of processing results;
    在计算所述第二数据窗口区域的输出数据的过程中,对所述第二数据窗口区域中除所述行重叠区域之外的行数据进行列处理,获得第二列处理结果,对所述第二列处理结果与所缓存的所述行重叠区域的第一列处理结果的行处理结果进行行处理,获得所述第二数据窗口的输出数据。In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the Row processing is performed on the second column processing result and the cached row processing result of the first column processing result in the row overlap area to obtain the output data of the second data window.
  6. 根据权利要求1至5中任一项所述的计算装置,其特征在于,所述计算装置还包括数据输入接口;The computing device according to any one of claims 1 to 5, wherein the computing device further comprises a data input interface;
    其中,所述配置接口还被配置为,向所述数据输入接口传输指示所述输入特征图在外部存储设备中的起始位置,以及指示所述输入特征图的分辨率的配置信息;Wherein, the configuration interface is further configured to transmit configuration information indicating the starting position of the input feature map in the external storage device and the resolution of the input feature map to the data input interface;
    所述数据输入接口被配置为:The data input interface is configured as:
    根据所述起始位置,以及所述输入特征图的分辨率,从所述外部存储设备中读取所述输入特征图的数据;Reading the data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;
    将读取的所述输入特征图的数据广播至所述N个计算单元中。Broadcast the read data of the input feature map to the N computing units.
  7. 根据权利要求6所述的计算装置,其特征在于,所述计算装置还包括缓存单元;The computing device according to claim 6, wherein the computing device further comprises a cache unit;
    其中,所述数据输入接口被配置为:Wherein, the data input interface is configured as:
    以行主序,从所述外部存储设备中并行读取所述输入特征图的数据;Read the data of the input feature map in parallel from the external storage device in a row-major sequence;
    将所述并行读取的输入特征图的数据缓存到所述缓存单元中;Buffer the data of the input feature map read in parallel in the buffer unit;
    对所述缓存单元中的输入特征图的数据进行并-串转换处理;Performing parallel-serial conversion processing on the data of the input feature map in the cache unit;
    将所述并-串转换处理得到的输入特征图的数据广播至所述N个计算单元中。Broadcasting the data of the input feature map obtained by the parallel-serial conversion process to the N computing units.
  8. 根据权利要求6或7所述的计算装置,其特征在于,所述计算装置 还包括:The computing device according to claim 6 or 7, wherein the computing device further comprises:
    数据输出接口,被配置为将所述N个计算单元计算得到的输出数据输出到所述外部存储设备中。The data output interface is configured to output the output data calculated by the N calculation units to the external storage device.
  9. 根据权利要求8所述的计算装置,其特征在于,所述计算装置还包括:The computing device according to claim 8, wherein the computing device further comprises:
    仲裁单元,被配置为将所述N个计算单元计算得到的输出数据按照预设顺序依次传输至所述数据输出接口。The arbitration unit is configured to sequentially transmit the output data calculated by the N calculation units to the data output interface in a preset order.
  10. 根据权利要求6至9中任一项所述的计算装置,其特征在于,所述S与所述数据输入接口读取数据的颗粒度以及所述计算单元在每个时钟周期处理像素点的数量相关。The computing device according to any one of claims 6 to 9, wherein the granularity of the data read by the S and the data input interface and the number of pixels processed by the computing unit in each clock cycle Related.
  11. 根据权利要求10所述的计算装置,其特征在于,所述计算单元在每个时钟周期处理一个像素点。11. The computing device of claim 10, wherein the computing unit processes one pixel in each clock cycle.
  12. 根据权利要求2至5中任一项所述的计算装置,其特征在于,所述第一计算单元包括:The computing device according to any one of claims 2 to 5, wherein the first computing unit comprises:
    运算模块,被配置为对落入所述数据窗口区域的数据进行所述运算处理,得到所述数据窗口区域的输出数据。The arithmetic module is configured to perform the arithmetic processing on the data falling in the data window area to obtain the output data of the data window area.
  13. 根据权利要求12所述的计算装置,其特征在于,所述运算模块的数量与所述池化输出框的宽度相关。The computing device according to claim 12, wherein the number of the computing modules is related to the width of the pooled output frame.
  14. 根据权利要求2至5中任一项所述的计算装置,其特征在于,所述第一计算单元还包括:The computing device according to any one of claims 2 to 5, wherein the first computing unit further comprises:
    存储模块,被配置为缓存所接收的所述输入特征图的数据。The storage module is configured to buffer the received data of the input feature map.
  15. 根据权利要求1至14中任一项所述的计算装置,其特征在于,所述配置接口被配置为:The computing device according to any one of claims 1 to 14, wherein the configuration interface is configured to:
    在所述计算装置完成所述N个感兴趣区域的池化处理之后,向所述S个计算单元中的P个计算单元传输指示P个感兴趣区域的位置的配置信息,所述P个感兴趣区域与所述P个计算单元一一对应,所述P为小于或等于所述S的正整数;After the computing device completes the pooling processing of the N regions of interest, it transmits configuration information indicating the positions of the P regions of interest to the P computing units of the S computing units, and the P sensors The region of interest has a one-to-one correspondence with the P calculation units, and the P is a positive integer less than or equal to the S;
    其中,所述P个感兴趣区域为当前张输入特征图上未进行池化处理的感兴趣区域,或者,所述P个感兴趣区域为下一张输入特征图上的感兴趣区域。Wherein, the P regions of interest are regions of interest that have not been pooled on the current input feature map, or the P regions of interest are regions of interest on the next input feature map.
  16. 根据权利要求1至15中任一项所述的计算装置,其特征在于,所述计算装置为专用集成电路ASIC或现场可编程门阵列FPGA。The computing device according to any one of claims 1 to 15, wherein the computing device is an application specific integrated circuit ASIC or a field programmable gate array FPGA.
  17. 一种感兴趣区域-池化层的计算方法,其特征在于,包括:A calculation method for a region of interest-pooling layer is characterized in that it includes:
    获取指示输入特征图上的N个感兴趣区域的位置的配置信息,所述N为正整数;Acquiring configuration information indicating positions of N regions of interest on the input feature map, where N is a positive integer;
    根据所述配置信息,对所述N个感兴趣区域进行并行池化处理,获得相应感兴趣区域的输出数据。According to the configuration information, parallel pooling processing is performed on the N regions of interest to obtain output data of the corresponding regions of interest.
  18. 根据权利要求17所述的计算方法,其特征在于,对所述N个感兴趣区域进行并行池化处理,获得相应感兴趣区域的输出数据,包括:The calculation method according to claim 17, wherein performing parallel pooling processing on the N regions of interest to obtain output data of the corresponding regions of interest comprises:
    对第一感兴趣区域进行池化处理,获得所述第一感兴趣区域的输出数据;Performing pooling processing on the first region of interest to obtain output data of the first region of interest;
    其中,所述对第一感兴趣区域进行池化处理,获得所述第一感兴趣区域的输出数据,包括:Wherein, the performing pooling processing on the first region of interest to obtain output data of the first region of interest includes:
    获取输入特征图的数据,所述输入特征图包括K个感兴趣区域,所述K为不小于所述N的正整数;Acquiring data of an input feature map, the input feature map including K regions of interest, and the K is a positive integer not less than the N;
    根据所述第一感兴趣区域的位置,以及池化输出框的分辨率,获得所述第一感兴趣区域的待输出数据在所述第一感兴趣区域上对应的数据窗口区域;Obtaining, according to the position of the first region of interest and the resolution of the pooled output frame, a data window area corresponding to the data to be output of the first region of interest on the first region of interest;
    从所获取的输入特征图的数据中选择落入所述数据窗口区域的数据;Selecting data that falls into the data window area from the acquired data of the input feature map;
    对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据。Perform arithmetic processing on the data falling in the data window area to obtain output data in the data window area.
  19. 根据权利要求18所述的计算方法,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:
    获取落入所述数据窗口区域的每一行数据的列处理结果;Acquiring a column processing result of each row of data falling in the data window area;
    对所述列处理结果进行行处理,获得所述数据窗口区域的输出数据。Perform row processing on the column processing result to obtain output data in the data window area.
  20. 根据权利要求18所述的计算方法,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:
    若所述第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取所述第一数据窗口区域的输出数据的过程中,缓存所述行重叠区域的第一列处理结果;If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached A list of processing results;
    在计算所述第二数据窗口区域的输出数据的过程中,对所述第二数据窗口区域中除所述行重叠区域之外的行数据进行列处理,获得第二列处理结果, 对所述第二列处理结果与所缓存的所述第一列处理结果进行行处理,获得所述第二数据窗口的输出数据。In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the The processing result of the second column and the buffered processing result of the first column are processed to obtain the output data of the second data window.
  21. 根据权利要求18所述的计算方法,其特征在于,所述对落入所述数据窗口区域的数据进行运算处理,获得所述数据窗口区域的输出数据,包括:The calculation method according to claim 18, wherein the performing arithmetic processing on the data falling in the data window area to obtain the output data of the data window area comprises:
    若所述第一感兴趣区域包括具有行重叠区域的第一数据窗口区域和第二数据窗口区域,在获取所述第一数据窗口区域的输出数据的过程中,缓存所述行重叠区域的第一列处理结果的行处理结果;If the first area of interest includes a first data window area and a second data window area having a line overlapping area, in the process of obtaining the output data of the first data window area, the first data window area of the line overlapping area is cached The row processing result of a column of processing results;
    在计算所述第二数据窗口区域的输出数据的过程中,对所述第二数据窗口区域中除所述行重叠区域之外的行数据进行列处理,获得第二列处理结果,对所述第二列处理结果与所缓存的所述行重叠区域的第一列处理结果的行处理结果进行行处理,获得所述第二数据窗口的输出数据。In the process of calculating the output data of the second data window area, column processing is performed on the row data in the second data window area excluding the row overlap area to obtain the second column processing result, and the Row processing is performed on the second column processing result and the cached row processing result of the first column processing result in the row overlap area to obtain the output data of the second data window.
  22. 根据权利要求17至21中任一项所述的计算方法,其特征在于,所述对所述N个感兴趣区域进行并行池化处理,包括:The calculation method according to any one of claims 17 to 21, wherein the performing parallel pooling processing on the N regions of interest includes:
    通过包括S个计算单元的计算装置中的N个计算单元对所述N个感兴趣区域进行并行池化处理,其中,所述N个计算单元与所述N个感兴趣区域一一对应,所述S为大于1的整数,所述N小于或等于所述S。Parallel pooling is performed on the N regions of interest by N computing units in a computing device including S computing units, where the N computing units correspond to the N regions of interest in a one-to-one correspondence, so The S is an integer greater than 1, and the N is less than or equal to the S.
  23. 根据权利要求22所述的计算方法,其特征在于,所述计算方法还包括:The calculation method according to claim 22, wherein the calculation method further comprises:
    通过数据输入接口:Through the data input interface:
    获取指示所述输入特征图在外部存储设备中的起始位置,以及指示所述输入特征图的分辨率的配置信息;Acquiring configuration information indicating the starting position of the input feature map in the external storage device and indicating the resolution of the input feature map;
    根据所述起始位置,以及所述输入特征图的分辨率,从所述外部存储设备中读取所述输入特征图的数据;Reading the data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;
    将读取的所述输入特征图的数据广播至所述N个计算单元中。Broadcast the read data of the input feature map to the N computing units.
  24. 根据权利要求23所述的计算方法,其特征在于,所述根据所述起始位置,以及所述输入特征图的分辨率,从所述外部存储设备中读取所述输入特征图的数据,包括:22. The calculation method of claim 23, wherein the data of the input feature map is read from the external storage device according to the starting position and the resolution of the input feature map, include:
    以行主序,根据所述起始位置,以及所述输入特征图的分辨率,从所述外部存储设备中并行读取所述输入特征图的数据;Read the data of the input feature map from the external storage device in parallel according to the starting position and the resolution of the input feature map in a row-major sequence;
    所述将读取的所述输入特征图的数据广播至所述N个计算单元中,包括:The broadcasting the read data of the input feature map to the N computing units includes:
    将所述并行读取的输入特征图的数据缓存到缓存单元中;Buffering the data of the input feature map read in parallel in the buffer unit;
    对所述缓存单元中的输入特征图的数据进行并-串转换处理;Performing parallel-serial conversion processing on the data of the input feature map in the cache unit;
    将所述并-串转换处理得到的输入特征图的数据广播至所述N个计算单元中。Broadcasting the data of the input feature map obtained by the parallel-serial conversion process to the N computing units.
  25. 根据权利要求22至24中任一项所述的计算方法,其特征在于,所述S与所述数据输入接口读取数据的颗粒度以及所述计算单元在每个时钟周期处理像素点的数量相关。The calculation method according to any one of claims 22 to 24, wherein the granularity of the data read by the S and the data input interface and the number of pixels processed by the calculation unit in each clock cycle Related.
  26. 根据权利要求25所述的计算方法,其特征在于,所述计算单元在每个时钟周期处理一个像素点。The calculation method according to claim 25, wherein the calculation unit processes one pixel in each clock cycle.
  27. 根据权利要求22至26中任一项所述的计算方法,其特征在于,所述计算单元包括运算模块,用于对落入所述感兴趣区域中相应数据窗口的数据进行运算处理。The calculation method according to any one of claims 22 to 26, wherein the calculation unit comprises an arithmetic module for performing arithmetic processing on the data falling in the corresponding data window in the region of interest.
  28. 根据权利要求27所述的计算方法,其特征在于,所述计算单元包括的所述运算模块的数量与池化输出框的宽度相关。The calculation method according to claim 27, wherein the number of the calculation modules included in the calculation unit is related to the width of the pooled output frame.
  29. 根据权利要求22至28中任一项所述的计算方法,其特征在于,所述计算单元还包括缓存模块,用于缓存所获取的所述输入特征图的数据。The calculation method according to any one of claims 22 to 28, wherein the calculation unit further comprises a cache module, configured to cache the acquired data of the input feature map.
  30. 根据权利要求17至29中任一项所述的计算方法,其特征在于,所述计算方法还包括:The calculation method according to any one of claims 17 to 29, wherein the calculation method further comprises:
    通过数据输出接口,将所述N个感兴趣区域的输出数据输出到外部存储设备中。The output data of the N regions of interest are output to an external storage device through the data output interface.
  31. 根据权利要求30所述的计算方法,其特征在于,所述计算方法还包括:The calculation method according to claim 30, wherein the calculation method further comprises:
    通过仲裁单元,将所述N个感兴趣区域的输出数据按照预设顺序依次传输至所述数据输出接口。Through the arbitration unit, the output data of the N regions of interest are sequentially transmitted to the data output interface in a preset order.
  32. 根据权利要求17至31中任一项所述的计算方法,其特征在于,在完成所述N个感兴趣区域的池化处理之后,所述计算方法还包括:The calculation method according to any one of claims 17 to 31, characterized in that, after the pooling processing of the N regions of interest is completed, the calculation method further comprises:
    获取指示P个感兴趣区域的位置的配置信息,其中,所述P个感兴趣区域为当前张输入特征图上未进行池化处理的感兴趣区域,或者,所述P个感兴趣区域为下一张输入特征图上的感兴趣区域,所述P为正整数。Acquire configuration information indicating the positions of P regions of interest, where the P regions of interest are regions of interest that have not been pooled on the current input feature map, or the P regions of interest are lower A region of interest on an input feature map, where P is a positive integer.
  33. 根据权利要求22至29中任一项所述的计算方法,其特征在于,所述计算装置为专用集成电路ASIC或现场可编程门阵列FPGA。The calculation method according to any one of claims 22 to 29, wherein the calculation device is an application specific integrated circuit ASIC or a field programmable gate array FPGA.
  34. 一种神经网络系统,其特征在于,包括:A neural network system is characterized in that it includes:
    如权利要求1至16中任一项所述的感兴趣区域-池化层的计算装置。The region of interest-pooling layer computing device according to any one of claims 1 to 16.
PCT/CN2019/118933 2019-11-15 2019-11-15 Roi-pooling layer computation method and device, and neural network system WO2021092941A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980039309.2A CN112313673A (en) 2019-11-15 2019-11-15 Region-of-interest-pooling layer calculation method and device, and neural network system
PCT/CN2019/118933 WO2021092941A1 (en) 2019-11-15 2019-11-15 Roi-pooling layer computation method and device, and neural network system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/118933 WO2021092941A1 (en) 2019-11-15 2019-11-15 Roi-pooling layer computation method and device, and neural network system

Publications (1)

Publication Number Publication Date
WO2021092941A1 true WO2021092941A1 (en) 2021-05-20

Family

ID=74336509

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118933 WO2021092941A1 (en) 2019-11-15 2019-11-15 Roi-pooling layer computation method and device, and neural network system

Country Status (2)

Country Link
CN (1) CN112313673A (en)
WO (1) WO2021092941A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871308A (en) * 2016-09-27 2018-04-03 韩华泰科株式会社 Method and apparatus for handling wide angle picture
US20180232629A1 (en) * 2017-02-10 2018-08-16 Kneron, Inc. Pooling operation device and method for convolutional neural network
CN110210490A (en) * 2018-02-28 2019-09-06 深圳市腾讯计算机系统有限公司 Image processing method, device, computer equipment and storage medium
CN110383330A (en) * 2018-05-30 2019-10-25 深圳市大疆创新科技有限公司 Pond makeup is set and pond method
CN110399977A (en) * 2018-04-25 2019-11-01 华为技术有限公司 Pond arithmetic unit

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229645B (en) * 2017-04-28 2021-08-06 北京市商汤科技开发有限公司 Convolution acceleration and calculation processing method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107871308A (en) * 2016-09-27 2018-04-03 韩华泰科株式会社 Method and apparatus for handling wide angle picture
US20180232629A1 (en) * 2017-02-10 2018-08-16 Kneron, Inc. Pooling operation device and method for convolutional neural network
CN110210490A (en) * 2018-02-28 2019-09-06 深圳市腾讯计算机系统有限公司 Image processing method, device, computer equipment and storage medium
CN110399977A (en) * 2018-04-25 2019-11-01 华为技术有限公司 Pond arithmetic unit
CN110383330A (en) * 2018-05-30 2019-10-25 深圳市大疆创新科技有限公司 Pond makeup is set and pond method

Also Published As

Publication number Publication date
CN112313673A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
US20200285446A1 (en) Arithmetic device for neural network, chip, equipment and related method
US20210073569A1 (en) Pooling device and pooling method
WO2018196863A1 (en) Convolution acceleration and calculation processing methods and apparatuses, electronic device and storage medium
US11734554B2 (en) Pooling processing method and system applied to convolutional neural network
JP6335335B2 (en) Adaptive partition mechanism with arbitrary tile shapes for tile-based rendering GPU architecture
WO2019084788A1 (en) Computation apparatus, circuit and relevant method for neural network
JP2002328881A (en) Image processor, image processing method and portable video equipment
WO2020034079A1 (en) Systolic array-based neural network processing device
WO2019216376A1 (en) Arithmetic processing device
US10070134B2 (en) Analytics assisted encoding
WO2019041264A1 (en) Image processing apparatus and method, and related circuit
CN111626405A (en) CNN acceleration method, CNN acceleration device and computer readable storage medium
WO2021258512A1 (en) Data aggregation processing apparatus and method, and storage medium
US20200379928A1 (en) Image processing accelerator
US20220113944A1 (en) Arithmetic processing device
CN109213745B (en) Distributed file storage method, device, processor and storage medium
US20200327638A1 (en) Connected component detection method, circuit, device and computer-readable storage medium
WO2021092941A1 (en) Roi-pooling layer computation method and device, and neural network system
CN116912556A (en) Picture classification method and device, electronic equipment and storage medium
CN111767243A (en) Data processing method, related device and computer readable medium
WO2021237513A1 (en) Data compression storage system and method, processor, and computer storage medium
US6771271B2 (en) Apparatus and method of processing image data
US20230214327A1 (en) Data processing device and related product
WO2021136433A1 (en) Electronic device and computer system
WO2022165718A1 (en) Interface controller, data transmission method, and system on chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19952593

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19952593

Country of ref document: EP

Kind code of ref document: A1