CN112313673A - Region-of-interest-pooling layer calculation method and device, and neural network system - Google Patents
Region-of-interest-pooling layer calculation method and device, and neural network system Download PDFInfo
- Publication number
- CN112313673A CN112313673A CN201980039309.2A CN201980039309A CN112313673A CN 112313673 A CN112313673 A CN 112313673A CN 201980039309 A CN201980039309 A CN 201980039309A CN 112313673 A CN112313673 A CN 112313673A
- Authority
- CN
- China
- Prior art keywords
- data
- region
- interest
- computing
- feature map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 90
- 238000011176 pooling Methods 0.000 title claims abstract description 85
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 193
- 238000000034 method Methods 0.000 claims abstract description 79
- 238000010586 diagram Methods 0.000 claims description 23
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000004148 unit process Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 12
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000003139 buffering effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Processing (AREA)
Abstract
A method and a device for calculating a region-of-interest pooling layer and a neural network system are provided. The computing device comprises a configuration interface and S computing units; the configuration interface is configured to transmit configuration information indicating positions of N regions of interest to N of the S computing units, wherein the N regions of interest are in one-to-one correspondence with the N computing units; each of the N calculation units is configured to pool the region of interest corresponding thereto, and obtain output data corresponding to the region of interest. The computing device comprises a plurality of computing units, so that the parallel pooling processing of a plurality of interested areas can be realized, and the processing efficiency of the ROI-posing layer can be improved without causing larger power consumption.
Description
Copyright declaration
The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office official records and records.
Technical Field
The present application relates to the field of data processing, and more particularly, to a method and apparatus for calculating a region-of-interest pooling layer, and a neural network system.
Background
At present, research on Artificial Intelligence (AI) is rapidly advanced, and especially, the accuracy of a Convolutional Neural Network (CNN) in the fields of image classification and detection is far higher than that of a traditional machine vision algorithm. The CNN is composed of several predefined elementary layers including a convolutional layer, an active layer, a pooling (posing) layer, a fully-connected layer, etc., wherein the pooling layer may include a region of interest (ROI) -pooling layer (ROI-posing layer).
In the prior art, data processing of the region-of-interest pooling layer is implemented by a Central Processing Unit (CPU) computing platform or a Graphics Processing Unit (GPU) computing platform. The region of interest-pooling layer is computationally expensive. The CPU computing platform has a low computational throughput and cannot meet the computational performance requirements of the region of interest-pooling layer. The power consumption of the GPU computing platform is too high. It can be known that the traditional CPU or GPU computing scheme cannot achieve the balance of computing performance and power consumption.
Therefore, there is a need to propose a region-of-interest-pooling layer processing scheme with less power consumption.
Disclosure of Invention
The application provides a method and a device for calculating an interested area-pooling layer and a neural network system, which can effectively improve the calculation efficiency of the interested area-pooling layer without causing larger power consumption.
A first aspect provides a region of interest-pooling layer computing apparatus. The computing device comprises a configuration interface and S computing units, wherein S is an integer larger than 1. The configuration interface is configured to transmit configuration information indicating positions of N regions of interest to N of the S calculation units, where the N regions of interest are in one-to-one correspondence with the N calculation units, and N is a positive integer less than or equal to S. Each of the N calculation units is configured to pool the region of interest corresponding thereto, and obtain output data corresponding to the region of interest.
A second aspect provides a method of calculating a region-of-interest-pooling layer. The calculation method comprises the following steps: acquiring configuration information indicating positions of N interested areas, wherein N is a positive integer; and performing parallel pooling on the N interested areas according to the configuration information to obtain output data of the corresponding interested areas.
A third aspect provides a neural network system that includes the region-of-interest-pooling layer computing device of the first aspect.
The computing device comprises a plurality of computing units and can support the realization of parallel pooling processing of a plurality of interested areas, so that the processing efficiency of an interested area-pooling layer can be improved.
Drawings
Fig. 1 is a functional schematic of a region-of-interest-pooling layer.
Fig. 2 is a schematic illustration of region-of-interest-pooling.
FIG. 3 is a schematic block diagram of a computing device according to an embodiment of the present application.
Fig. 4 is a schematic flow chart of acquiring output data of a data window region in the embodiment of the present application.
Fig. 5 is another schematic of region-of-interest-pooling.
Fig. 6 is yet another schematic of region-of-interest-pooling.
Fig. 7 is yet another schematic of region-of-interest-pooling.
Fig. 8 is a schematic block diagram of a computing unit of an embodiment of the present application.
Fig. 9 is another schematic block diagram of a computing unit of an embodiment of the present application.
FIG. 10 is another schematic block diagram of a computing device according to an embodiment of the present application.
Fig. 11 is a schematic flow chart of a region of interest-pooling layer calculation method according to an embodiment of the present application.
Fig. 12 is a schematic block diagram of a neural network system according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
For better understanding of the embodiments of the present application, the related concept of region-of-interest-pooling layer (hereinafter referred to as ROI-pooling layer) is described below.
As shown in fig. 1, the function of the ROI-posing layer is to down-sample a region of interest (ROI) in the feature map.
Input Feature Map (IFM) of the ROI-posing layer is an output of the previous layer. The input data of the ROI-posing layer may be an array composed of one feature map (feature map) or a 3D array composed of a plurality of feature maps. As shown in fig. 1, the input data of the ROI-posing layer is L feature maps, and the resolution of each feature map is H (height) × W (width).
The output data (OFM) of the ROI-posing layer is composed of several cubes, as shown on the right side of fig. 1, there are M cubes, and the number of cubes output from the ROI-posing layer is determined by the number of regions of interest (ROI) in the input feature map.
The dimensions of each cube are the same, e.g., in the example of FIG. 1, each cube is made up of L output feature maps. Where the resolution of each output feature map is the same, for example, in the example of fig. 1, the resolution of each output feature map in the cube is E (height) × F (width).
The function of the ROI-posing layer is to downsample the region of interest in the input feature map. For example, in the example of fig. 1, taking one of the L feature maps as an example, the ROI-posing layer down-samples the input feature map with the resolution H × W to the output feature map with the resolution E × F.
The resolution of the feature map output by the ROI-posing layer may be predefined, for example, in the example of fig. 1, the resolution E × F of the output cube may be predefined.
The mapping method of the ROI-posing layer pooling process can also be predefined, and generally includes maximum (max) or average (avg).
It can be understood that the ROI-posing layer is characterized in that the size of the region of interest to be pooled may not be fixed, and the size of the output feature map corresponding to each region of interest is fixed.
By way of example and not limitation, in fig. 1, the calculation process of ROI-posing layer is: reversely deducing a data window area corresponding to the output data on the input feature map point by point according to the resolution E multiplied by F of the output cube and the position of the region of interest in the input feature map; and carrying out operation processing on the data in the data window area to obtain output data corresponding to the data window area. The arithmetic processing here may be maximum value calculation or average value calculation.
For ease of understanding and description, but not limitation, concepts and terms related to the present application are described below.
1. Region of interest
The region of interest represents the region on the input feature map that is to be pooled (i.e., downsampled).
2. Resolution of pooled output frames
The resolution of the pooled output box represents the resolution of the feature map obtained after the region of interest has been pooled. For example, the resolution of the pooled output box is the resolution E F of the cube of the output shown in FIG. 1.
The pooled output box may also be referred to as a feature map obtained after the region of interest has been pooled.
Herein, the pixel points in the feature map obtained after the region of interest is subjected to pooling processing are referred to as output data. It should be understood that assuming the resolution of the pooled output box is E x F, one region of interest corresponds to (E x F) output data.
3. Data window area
The data window region indicates a region on which output data corresponding to a certain region of interest corresponds.
Taking an output data corresponding to a certain region of interest as an example, the output data is obtained by pooling data in a certain sub-region of the region of interest. A sub-region may be referred to herein as a data window region.
The above concepts are described below in connection with the example of fig. 2.
In fig. 2, the resolution of the pooled output frame is 2 × 2, which indicates that the resolution of the feature map obtained after pooling any one of the regions of interest in the input feature map is 2 × 2, or that any one of the regions of interest in the input feature map corresponds to 4 pieces of output data.
In fig. 2, a region of interest surrounded by lines 4 to 8 and columns 1 to 7 in the input feature map is shown. Assume that the 4 output data corresponding to the region of interest are C1, C2, C3 and C4, where the data window area corresponding to C1 on the region of interest is a1, the data window area corresponding to C2 on the region of interest is a2, the data window area corresponding to C3 on the region of interest is A3, and the data window area corresponding to C4 on the region of interest is a 4. That is, C1 is obtained by performing arithmetic processing (maximum value or average value) on the data in the data window area a1, and so on for the way of obtaining C2, C3 and C4.
It should be understood that fig. 2 is only an example and not a limitation, and in practical applications, a plurality of regions of interest may be included on one input feature map.
In the conventional technology, a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) is used to process the calculation of the ROI-posing layer. The computational efficiency of the CPU computing platform cannot meet the computational performance requirement of the ROI-posing layer, and the power consumption of the GPU computing platform is large.
The application provides a method and a device for calculating an ROI-posing layer and a neural network system, which can effectively improve the calculation efficiency of the ROI-posing layer and cannot cause larger power consumption.
Fig. 3 is a schematic block diagram of a computing device 300 for a ROI-posing layer provided herein.
As shown in fig. 3, the computing apparatus 300 includes a plurality of computing units 320, each computing unit 320 having a function of pooling a region of interest. It is to be understood that by means of the plurality of calculation units 320, a parallel pooling of a plurality of regions of interest may be achieved. That is, the computing device provided by the present application may implement parallel data processing of the ROI-posing layer, so that the processing efficiency of the ROI-posing layer may be improved.
As shown in fig. 3, the computing apparatus 300 further comprises a configuration interface 310 configured to transmit configuration information indicating the positions of N regions of interest to the N computing units 320, wherein the N regions of interest are in one-to-one correspondence with the N computing units 320.
The configuration information may indicate the location of the N regions of interest on the input feature map.
For example, the configuration information may indicate coordinates of pixel points in the N regions of interest.
Optionally, the position of the region of interest includes the coordinates of the first pixel in the upper left corner of the region of interest, and the size of the region of interest.
Optionally, the location of the region of interest includes coordinates of all pixel points in the region of interest.
Each of the N calculation units 320 is configured to pool the region of interest corresponding thereto, and obtain output data of the corresponding region of interest.
For example, each computing unit 320 of the N computing units 320 is configured to determine a region of interest corresponding thereto from the received configuration information; and performing pooling processing on the data in the region of interest corresponding to the data to obtain output data of the region of interest.
Alternatively, the N computing units 320 represent all computing units of the computing device 300, i.e., all computing units 320 of the computing device 300 participate in the pooling process.
Alternatively, the N computing units 320 represent a portion of the computing devices 300, i.e., a portion of the computing units 320 of the computing device 300 participate in the pooling process.
For example, it is noted that the computing apparatus 300 includes a total number of computing units of S, and N is a positive integer less than or equal to S. For example, S is an integer greater than 1.
It should be understood that in practical applications, it may be determined whether some or all of the computing units in the computing apparatus 300 participate in the operation according to application requirements.
Optionally, N is an integer greater than 1.
It should be understood that, in this embodiment, the pooling processing of the region of interest greater than 1 may be implemented by more than 1 computing units 320, that is, the parallel data processing of the ROI-posing layer may be implemented, so that the processing efficiency of the ROI-posing layer may be improved.
It is to be understood that N may also be equal to 1. For example, if only 1 region of interest needs to be pooled in the current calculation task, the value of N is set to 1.
It should also be understood that, no matter what the value of N is, the computing device provided by the present application may support implementing parallel pooling processing of multiple regions of interest, and thus, may improve the processing efficiency of the ROI-posing layer.
It should be understood that the function of each computing unit 320 of the N computing units 320 is the same, i.e. the method by which each computing unit 320 pools the region of interest is similar. For ease of understanding and description, the function and operation of the computing unit 320 are described herein by taking the first computing unit 320 of the N computing units 320 as an example. It should be understood that the description herein of the first computing unit 320 may be adaptively applied to each computing unit 320 of the N computing units 320.
The first calculation unit 320 is configured to pool the first region of interest, obtaining output data of the first region of interest. The first region of interest represents a region of interest of the N regions of interest corresponding to the first calculation unit 320. The first computing unit 320 performs pooling processing on the first region of interest, and the method of obtaining the output data of the first region of interest includes the following steps S410 to S440, as shown in fig. 4.
S410, data of an input feature map are obtained, the input feature map comprises K interested areas, and K is a positive integer not less than N.
With continued reference to fig. 3, the computing arrangement 300 further includes a data input interface 330 configured to read data of the input feature map from an external storage device. The first calculation unit 320 may acquire data of the input feature map from the data input interface 320. For example, the data input interface 330 is configured to send data of the input feature map to the first computing unit 320.
And S420, obtaining a data window area corresponding to the data to be output of the first region of interest on the first region of interest according to the position of the first region of interest and the resolution of the pooling output frame.
And S430, selecting data falling into the data window area from the acquired data of the input feature map.
And S440, performing operation processing on the data falling into the data window area to obtain output data of the data window area. The operation processing may be maximum value calculation or average value calculation, and the specific processing mode may be predefined.
It is to be understood that when the first calculation unit 320 obtains the output data of all data window regions of the first region of interest, the output data of the first region of interest is also obtained. Assuming that the resolution of the pooled output box is E × F, the first region of interest corresponds to (E × F) output data.
The computing device provided by the application can realize the parallel data processing of the ROI-posing layer, and can effectively improve the processing efficiency of the ROI-posing layer on the premise of lower power consumption compared with the prior art.
It should be understood that the method of dividing a region of interest into a plurality of data window regions based on the resolution of the pooled output frames is prior art and will not be described in detail herein.
Step S440 may be implemented by various implementations.
Optionally, in step S440, performing an arithmetic operation on the data falling into the data window region to obtain output data of the data window region, including: acquiring a column processing result of each row of data falling into the data window area; and performing row processing on the column processing result to obtain output data of the data window area.
It should be understood that if the operation method (also referred to as a mapping relationship) of the pooling process is to obtain the maximum value, the operation method corresponding to the column process is to obtain the maximum value, and the operation method corresponding to the row process is also to obtain the maximum value. If the operation mode (also called mapping relation) of the pooling process is to find the average value, the operation mode corresponding to the column process is to find the accumulated sum, and the operation mode corresponding to the row process is to find the accumulated sum first and then find the average value.
An example of pooling the region of interest by the first computing unit 320 is described below in connection with fig. 5. In fig. 5, the operation method (also referred to as a mapping relation) of the pooling process is to obtain a maximum value, the resolution of the pooled output frame is 2 × 2, and the resolution of the region of interest is 8 × 4.
In the example of fig. 5, the pooling of the region of interest by the first calculation unit 320 comprises the following steps 1-1 and 1-2.
Step 1-1, based on the resolution 2 × 2 of the pooled output frame, determines 4 data window regions in the region of interest, such as data window region 1, data window region 2, data window region 3, data window region 4 shown in fig. 5.
It should be understood that step 1-1 corresponds to step S420 and step S430 in fig. 4.
And 1-2, respectively carrying out maximum value processing on the data in the 4 data window regions to obtain corresponding output data.
As shown in fig. 5, the data in the data window region 1 is subjected to maximum value processing to obtain output data 29; carrying out maximum value solving processing on the data in the data window area 2 to obtain output data 31; carrying out maximum value processing on the data in the data window area 3 to obtain output data 30; the data in the data window region 4 is subjected to a maximum value processing to obtain output data 28.
Taking the data window region 1 as an example, in step 1-2, the data in the data window region 1 is subjected to the maximum value processing, and obtaining the output data 29 of the data window region 1 may include the following sub-steps.
Substep 1-2-1, the column processing result of each row of data falling into the data window region 1 is obtained.
For example, the maximum value is obtained for the first row data {12,3} to obtain the column processing result 12 of the first row data; the second row data {29,26} is subjected to maximum value calculation to obtain a column processing result 29 of the second row data; the maximum value of the third row data {2,11} is obtained, and a column processing result 11 of the third row data is obtained; the fourth row data {12,13} is maximized to obtain a column processing result 13 of the fourth row data.
Substep 1-2-2, the column processing result of each row of data in the data window region 1 is processed in a row, obtaining the output data 29 of the data window region 1.
For example, the column processing result {12,29,11,13} of 4 lines of data in the data window region 1 is maximized to obtain a line processing result of 29, that is, output data 29 of the data window region 1.
It should be understood that step 1-2 corresponds to step S440 in fig. 4.
It should also be understood that in the example shown in fig. 5, a feature map with a resolution of 8 × 4 is input, and a feature map with a resolution of 2 × 2 is output.
In the example of fig. 5, there is no overlap region between different data window regions. In some cases, there may be overlapping regions between different data window regions, as shown in fig. 6.
In fig. 6, the operation method (also referred to as a mapping relationship) of the pooling process is to obtain a maximum value, the resolution of the pooled output frame is 2 × 2, and the resolution of the region of interest is 6 × 4. As can be seen from fig. 6, output data 29 and 22 correspond to data window region 1 and data window region 3, respectively, with an overlap region therebetween, and output data 31 and 23 correspond to data window region 2 and data window region 4, respectively, with an overlap region therebetween.
It should be understood that fig. 6 is merely exemplary and not limiting, and in practical applications, the overlap region between two data window regions may include one or more rows of data, or may include one or more columns of data.
If the overlap region between two data window regions comprises one or more lines of data, the overlap region may also be referred to as a line overlap region. If the overlap region between two data window regions comprises one or more columns of data, the overlap region may also be referred to as a column overlap region.
Optionally, in step S440, performing an arithmetic operation on the data falling into the data window region to obtain output data of the data window region, including: if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching a first column processing result of the line overlapping regions in the process of acquiring output data of the first data window region; and in the process of calculating the output data of the second data window region, performing row processing on the data in the second data window region except for the row overlapping region to obtain a second row processing result, and performing row processing on the second row processing result and the cached first row processing result to obtain the output data of the second data window.
For example, the first region of interest is the region of interest shown in fig. 6, the first data window region is the data window region 1 shown in fig. 6, and the second data window region is the data window region 3 shown in fig. 6. The data window region 1 and the data window region 3 have a line overlap region, and the line overlap region includes two lines of data { {2,11}, {12,13} }.
In the example of fig. 6, the process of the first calculation unit 320 obtaining the output data 29 falling within the data window area 1 may include the following steps 2-1 to 2-3. The process of the first calculation unit 320 obtaining the output data 22 falling in the data window area 3 may comprise the following steps 3-1 and 3-2.
And 2-1, performing column processing on each row of data falling into the data window area 1 to obtain a column processing result of each row of data. In this example, the operation method of the pooling process is to obtain the maximum value, and accordingly, the operation methods of the column process and the row process are both to obtain the maximum value.
Referring to fig. 6, the first row data {12,3} is maximized to obtain a column processing result 12 of the first row data; the second row data {29,26} is subjected to maximum value calculation to obtain a column processing result 29 of the second row data; the maximum value of the third row data {2,11} is obtained, and a column processing result 11 of the third row data is obtained; the fourth row data {12,13} is maximized to obtain a column processing result 13 of the fourth row data.
And 2-2, performing row processing on the column processing result obtained in the step 2-1 to obtain a row processing result, namely obtaining the output data of the data window area 1.
Referring to fig. 6, the column processing results {12,29,11,13} obtained in step 2-1 are maximized to obtain row processing results 29, i.e., output data 29 of the data window region 1.
Step 2-3, cache the column processing results {11,13} for the row overlap region (i.e., rows 3 and 4).
Alternatively, step 2-3 may be included in step 2-1 or step 2-2.
And 3-1, performing row and column processing on the data falling into the data window area 3 except the row overlapping area.
Referring to fig. 6, the third row data {7,14} is maximized to obtain the column processing result 14 of the third row data; the fourth row data {22,4} is maximized to obtain a column processing result 22 of the fourth row data.
And 3-2, performing row processing on the column processing result {14,22} obtained in the step 3-1 and the column processing result {11,13} of the row overlapping area cached in the step 2-3 to obtain a row processing result, namely obtaining the output data of the data window area 3.
That is, the maximum value is obtained between the column processing result {14,22} obtained in step 3-1 and the column processing result {11,13} of the line overlap region buffered in step 2-3, and the output data 22 of the data window region 3 is obtained.
In the example described above in conjunction with fig. 6, the first calculation unit 320 omits the read operation for the first row data {2,11} and the second row data {12,13} in the acquisition data window region 3 in the process of acquiring the output data of the data window region 3, and directly utilizes the column processing results of the two rows of data buffered in the process of acquiring the output data of the data window region 1. As can be seen from the above step 2-1, the reading operation of the two rows of data {2,11} and {12,13} in the row repetition region has been performed in the process of acquiring the output data of the data window region 1. Therefore, the embodiment can avoid repeated data reading under the condition that different data window regions have overlapping regions.
It should be understood that the present embodiment can avoid repeated reading of data, thereby saving bandwidth and improving computational efficiency.
It should be understood that step 2-1 through step 2-3, and step 3-1 through step 3-2 described above may be alternative embodiments of step S440.
Optionally, in step S440, performing an arithmetic operation on the data falling into the data window region to obtain output data of the data window region, including: if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching line processing results of first column processing results of the line overlapping regions in the process of acquiring output data of the first data window region; in the process of calculating the output data of the second data window region, row processing is performed on the data of the second data window region except the row overlapping region to obtain a second row processing result, and the row processing result of the second row processing result and the row processing result of the cached first row processing result of the row overlapping region are processed to obtain the output data of the second data window.
As one example. See fig. 6. For example, the first region of interest is the region of interest shown in fig. 6, the first data window region is the data window region 1 shown in fig. 6, and the second data window region is the data window region 3 shown in fig. 6. The data window region 1 and the data window region 3 have a line overlap region, and the line overlap region includes two lines of data { {2,11}, {12,13} }.
The process of the first calculation unit 320 obtaining the output data 29 falling in the data window area 1 may comprise steps 4-1 to 4-3. The process of the first calculation unit 320 obtaining the output data 22 falling in the data window area 3 may comprise step 5-1 and step 5-2.
Wherein step 4-1 is the same as step 2-1 described above, and step 4-2 is the same as step 2-2 described above.
Step 4-3, cache the row processing results {13} of the column processing results {11,13} of the row overlap region (i.e., rows 3 and 4).
Alternatively, step 4-3 may be included in step 4-2.
Step 5-1 is the same as step 3-1 described above.
And step 5-2, performing line processing on the column processing result {14,22} obtained in the step 5-1 and the line processing result {13} of the line overlapping area cached in the step 4-3 to obtain a line processing result, namely obtaining the output data of the data window area 3.
That is, the maximum value is obtained between the column processing result {14,22} obtained in step 5-1 and the line processing result {13} of the line overlap region buffered in step 4-3, and the output data 22 of the data window region 3 is obtained.
In the present embodiment, the first calculation unit 320 omits the read operation for acquiring the first line data {2,11} and the second line data {12,13} in the data window region 3 in the process of acquiring the output data of the data window region 3, and directly utilizes the line processing results of the two lines of data buffered in the process of acquiring the output data of the data window region 1.
As can be seen from the above step 4-1, the reading operation of the two rows of data {2,11} and {12,13} in the row repetition region has been performed in the process of acquiring the output data of the data window region 1. Therefore, the embodiment can avoid repeated data reading under the condition that different data window regions have overlapping regions.
In addition, in step 4-3 of the embodiment, only the line processing results {13} in the overlapping region (i.e. the 3 rd line and the 4 th line) need to be cached, which can relatively reduce the caching requirement of the first computing unit 320.
It should be understood that the above-mentioned steps 4-1 to 4-3, and steps 5-1 to 5-2 may be alternative embodiments of step S440.
As another example, see fig. 7. As shown in fig. 7, the left side is a certain column of data on the region of interest, and the right side is a column of output data in the pooled output box (which may also be referred to as an output feature map) corresponding to the region of interest. The pixel point a is calculated by input data {1,2,3,4,5}, and the pixel point b is calculated by input data {3,4,5,6,7 }. When the input data is processed line by line, the calculation result of {3,4,5} is obtained while calculating the pixel point a. The {3,4,5} calculation result is both the intermediate result for pixel a and the intermediate result for pixel b. In this case, in the process of calculating the pixel point a, the calculation result of {3,4,5} is cached, and when the pixel point b is calculated, the calculation result of {3,4,5} can be directly read, and the pixel point b is calculated based on the calculation result of {3,4,5} and {6,7 }.
It should be understood that, by avoiding repeated data reading, the present embodiment can save bandwidth, improve computational efficiency, and reduce cache requirements.
As described above, in step S440, the data falling into the data window region is subjected to the arithmetic processing by way of the first-column processing and the second-row processing, and the output data of the data window region is obtained. The present application is not limited thereto. For example, in step S440, the data falling in the data window region may be subjected to arithmetic processing by way of a preceding processing and a rearranging processing, thereby obtaining output data of the data window region.
Optionally, step S440 includes: acquiring a row processing result of each column of data falling into the data window area; and performing column processing on the acquired row processing result to acquire output data of the data window area.
It should be understood that, in the embodiment of performing operation processing on the data falling into the data window region by way of performing advanced processing and then performing listed processing, under the condition that different data window regions have overlapping regions, repeated data reading can also be avoided, and the implementation manner is similar to that described in the above embodiment, and is not described again here.
Optionally, as shown in fig. 8, the first computing unit 320 may include one or more operation modules 324, where the operation modules 324 are configured to perform operation processing on the data falling into the data window region, and obtain the output data of the data window region.
For example, step S440 in the above embodiment is performed by the operation module 324 in the first calculation unit 320.
Optionally, the first computing unit 320 includes a plurality of operation modules 324 therein, wherein each operation module 324 is configured to obtain one output data.
Alternatively, the number of operation modules 324 included in the first calculation unit 320 may be related to the width of the pooled output blocks.
For example, the resolution of the pooled output blocks is E (high) × F (wide), and the number of operation modules 324 in the first calculation unit 320 may be F.
It should be understood that the number of the operation modules in the first calculation unit 320 may also be determined according to actual requirements, and the application is not limited thereto. For example, in practical applications, the number of the operation modules in the first calculation unit 320 may be determined by comprehensively considering the performance requirements of the applications, the occupation of resources, and the like. Wherein the occupation of the resource comprises any one or more of the following items: occupation of memory space (memory), volume of the device.
Optionally, the first computing unit 320 has a function that the computing module can clip. In other words, the number of operation modules of the first calculation unit 320 may be dynamically changed.
Therefore, the first computing unit 320 provided by the present application has a function of clipping the computing modules, so that the number of the computing modules in the first computing unit 320 can be flexibly adjusted based on the computing requirements, the computing efficiency can be improved, and the utilization rate of the computing resources can be improved.
The first computing unit 320 may further include a computing control module configured to execute step S420 and step S430 in the above embodiments.
Alternatively, the calculation control module and the operation module 324 may be separately arranged or may be arranged in a centralized manner. For example, the calculation module 324 includes the calculation control module.
Optionally, as shown in fig. 8, the first computing unit 320 may further include a sub configuration interface 321, a sub data interface 322, and a storage module 323.
The sub-configuration interface 321 is configured to receive configuration information from the configuration interface 310.
For example, the configuration interface 310 and the sub-configuration interface 321 may each be an Advanced Peripheral Bus (APB) interface.
The sub-data interface 322 is configured to receive data of the input profile from the data input interface 330.
The storage module 323 is configured to cache the intermediate processing results.
It is to be understood that the storage module 323 may be configured to buffer intermediate processing data that the first calculation unit 320 needs to reuse in acquiring the output data of the first region of interest.
As an example, in the above-described embodiment with overlapping regions between different data window regions, the storage module 323 may be configured to buffer column processing results or row processing results of data on the row overlapping regions that data window region 1 and data window region 3 have.
Alternatively, the storage module 323 may be located in the operation module 324.
Optionally, the storage module 323 may be further configured to store data of the input characteristic map received by the sub data interface 322.
For example, the data of the input feature map stored in the storage module 323 can support the pooling of the plurality of regions of interest by the first computing unit 320. That is, the first computing unit 320 may directly obtain the data of the input feature map from the storage module 323 without obtaining the data from the outside during the pooling process of the one or more regions of interest.
Fig. 9 is another schematic block diagram of the first calculation unit 320. The first calculation unit 320 includes a sub configuration interface 321, a sub data interface 322, a storage module 323, and an operation module 324. The operation module 324 includes a control circuit portion shown in a left half of a block labeled 324 in fig. 9 and an operation circuit portion shown in a right half of the block.
The sub-configuration interface 321 is configured to receive configuration information sent by the configuration interface 310, the configuration information indicating a location of the region of interest.
The sub-data interface 322 is configured to receive data of the input profile sent by the data input interface 330.
The storage module 323 is configured to buffer data received by the sub-data interface 322 and may also be used to buffer intermediate processing results.
The control circuit portion of the operation module 324 is configured to execute the steps S420 and S430 described in the above embodiments.
For example, the control circuitry portion in the calculation module 324 is configured to calculate the start coordinate (w _ start _ floor) and the end coordinate (w _ end _ ceil) of the data window area currently to be calculated, based on the position of the region of interest and the resolution of the pooled output frames.
The arithmetic circuit portion in the arithmetic module 324 is configured to execute step S440 described in the above embodiment.
As shown in fig. 9, the arithmetic circuit portion includes a plurality of arithmetic circuits.
In a scenario where the operation manner of the pooling process is to find the maximum value, the operation circuit is configured to have a comparison function. Taking an arithmetic circuit as an example, the arithmetic circuit has one or more input terminals and an output terminal, and the arithmetic circuit can compare data input from the input terminals to obtain the maximum value and output the maximum value to the output terminal. For example, the arithmetic circuit may be realized by a circuit constituted by a comparison circuit or a comparison operator.
In a scenario where the operation manner of the pooling process is averaging, the operation circuit is configured to have a function of accumulating and averaging. Taking an arithmetic circuit as an example, the arithmetic circuit has one or more input terminals and an output terminal, the transport circuit may perform an accumulation operation on data input from the input terminals to obtain an accumulated sum and output the accumulated sum to the output terminal, and the arithmetic circuit may further perform an averaging operation on the accumulated sum to obtain an average value and output the average value to the output terminal. For example, the arithmetic circuit may be implemented by an adder and a multiplier.
With continued reference to fig. 3, the computing device 300 may include a data input interface 330. The configuration interface 310 is further configured to transmit configuration information indicating a start position of the input feature map in the external storage device and indicating a resolution of the input feature map to the data input interface 330. The data input interface 330 is configured to: reading data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map; the data of the read input feature map is broadcast to the N calculation units 320.
It should be understood that, by broadcasting the data of the read input feature map to the N computing units 320 through the data input interface 330, and enabling the data input interface 330 to perform a data reading operation from the external storage device once, the N computing units 320 can all obtain the read data, and therefore, repeated data reading can be avoided, and the bandwidth can be saved.
The computing device 300 may be a system on a chip. It should be appreciated that the memory resources of a system-on-chip are typically small, generally requiring the retrieval of data to be processed from an external memory device. In this application, the computing apparatus 300 may obtain input data of the ROI-Pooling layer, i.e., data of the input feature map, from an external storage device.
The external storage device may be a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM). DDR SDRAM may be referred to as DDR memory or DDR for short. It should be understood that the implementation of the external storage device is not limited herein.
The data input interface 330 may be configured to read data in a data storage format of an external storage device.
Optionally, the data of the input feature map is stored in the external storage device in a format that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The input data interface 330 is configured to read input characteristic data from an external storage device in bursts of X bits X L, L being a positive integer. For example, X takes the value of 128.
As an example, in the external storage device, each line of data is stored in 128-bit (bit) alignment. Assuming that the quantized bit width of the data is 8 bits, every 16 data alignments are stored in one address of the external storage device, and the end of a line is not filled with valid data enough to ensure that the start address of the next line is 128 bits aligned. In this case, the data input interface 330 is configured to read data in the external storage device in a burst (burst) manner of 128bit × L, that is, access the external storage device in a burst manner, and each time the external storage device is accessed, the granularity is 128 bit. Wherein L is a positive integer. For example, L is a positive integer less than or equal to 8, i.e., the data input interface 330 accesses up to 8 addresses of data per burst.
It should be understood that the manner in which the computing apparatus provided by the present application reads data from the external storage device may be adapted to the data storage format of the external storage device, so that the data reading efficiency may be improved.
In the case where the input feature map is three-dimensional, for example, the input feature map is an input cube as shown in fig. 1, and the data input interface 330 reads the input feature maps in the order of reading them one by one.
Optionally, the computing device 300 further comprises a caching unit (not shown in fig. 3). The data input interface 330 is configured to: reading data of the input characteristic diagram from the external storage equipment in parallel in a line main sequence; caching the data of the input characteristic diagram read in parallel into a cache unit; performing parallel-serial conversion processing on the data of the input characteristic diagram in the cache unit; the data of the input feature map obtained by the parallel-serial conversion processing is broadcast to the N calculation units 320.
The data of the input characteristic diagram is read from the external storage device in parallel in a line main sequence, and the data of the input characteristic diagram is read from the external storage device in parallel in line granularity.
For example, the data input interface 330 may be configured to read data in a zigzag scanning manner for an input feature map.
For example, the data input interface 330 may broadcast the data of the input feature map obtained by the parallel-to-serial conversion process to the N calculation units 320 in raster order.
It should be understood that the data input interface 330 employs a cache unit to cache the data of the input characteristic diagram, and the problem of unmatched processing speeds of the internal and external data can be solved while implementing data caching.
The buffer unit may be located in the data input interface 330.
Alternatively, the buffer unit may be a First Input First Output (FIFO) module. The FIFO module can be a FIFO memory or a FIFO queue.
The data bit width of the FIFO module may be designed according to the data storage format of the external storage device.
For example, in the external storage device, each row of data is stored in a 128-bit (bit) aligned manner, and the data bit width in the FIFO module is 128 bits.
It should be understood that if the data storage format of the external storage device is such that each row of data is stored in 128-bit (bit) aligned manner, and the last insufficient data is filled to ensure that the start address of the next row is 128-bit aligned, the data input interface 330 may remove the filled invalid data while performing parallel-to-serial processing on the data stored in the storage unit.
The buffer unit may be provided separately from the data input interface 330 or may be provided integrally therewith. For example, the buffer unit may be located in the data input interface 330.
Alternatively, the computing device 300 may include a number of computing units 320 that is related to the granularity at which the data input interface 330 reads data and the number of pixels processed by the computing units 320 per clock cycle.
As an example, assuming that the computing unit 320 processes 1 pixel point in each clock cycle, if the granularity of the data read by the data input interface 330 is 128 bits (assuming that the quantization bit width of the data is 8 bits), the computing apparatus 300 may include the computing unit 320 in a number of 16, that is, S may take a value of 16.
As another example, assuming that the computing unit 320 processes 1 pixel point in each clock cycle, if the granularity of the data read by the data input interface 330 is 128 bits × 8 (assuming that the quantization bit width of the data is 8 bits), the computing apparatus 300 may include the computing unit 320 in a number of 128, that is, S may take a value of 128.
It should be noted that, in the present application, the number of the computing units 320 included in the computing apparatus 300 may be determined according to actual requirements. For example, in practical applications, the number of computing units 320 included in the computing apparatus 300 may be determined by comprehensively considering the performance requirements of the applications, the occupation of resources, and the like. Wherein the occupation of the resource comprises any one or more of the following items: occupation of memory space (memory), volume of the device.
Optionally, the computing device 300 has functionality that is tailorable by the computing unit 320. In other words, the number of computing units 320 in the computing device 300 may be dynamically varied.
For example, in the case of fewer regions of interest to be processed, the computing apparatus 300 may be provided with a smaller number of computing units 320; in case of a larger number of regions of interest to be processed, the computing device 300 may be arranged with a larger number of computing units 320.
It should be understood that the computing device 300 has the capability of the computing unit 320 to be tailorable, such that the computing device 300 can be flexibly adapted to ROI-posing layers having different computing requirements.
It should be understood that the more computing units 320 that the computing device 300 includes, the more storage space the computing device 300 requires, and the more overall volume the computing device 300 is; the less computing unit 320 that the computing device 300 includes, the less memory space the computing device 300 requires, and the less the overall volume of the computing device 300. Therefore, the computing device 300 provided by the present application has the cuttable function of the computing units 320, so that the number of the computing units 320 can be flexibly adjusted based on the computing requirements, and thus, the occupation of resources can be effectively saved on the basis of ensuring the computing performance.
It should also be understood that the computing device 300 may also include a number S of computing units 320 of 1 based on application requirements.
With continued reference to fig. 3, optionally, the computing apparatus 300 further includes a data output interface 340 configured to output the output data calculated by the N calculating units 320 to an external storage device.
Optionally, the data of the input feature map is stored in the external storage device in a format that each row of input feature data is stored in an X-bit aligned manner, and X is a multiple of 8. The data output interface 330 is configured to splice output data of the S computing units into X bits for alignment buffering, and output the data of the alignment buffering to the external storage device.
Optionally, N is an integer greater than 1, and as shown in fig. 10, the computing apparatus 300 may further include an arbitration unit 350 configured to sequentially transmit the output data computed by the N computing units 320 to the data output interface 340 according to a preset order.
For example, the arbitration unit 350 may be configured to transmit the aligned data to the data output interface 340 using a fair polling algorithm.
It should be understood that, for the output data of the plurality of computing units 320, the arbitration unit 350 is adopted to transmit the output data to the data output interface 340 according to the preset sequence, which is beneficial to the management of the data in the subsequent process.
Optionally, the configuration interface 310 is configured to transmit configuration information indicating positions of P regions of interest to P computing units 320 of the S computing units 320 after the computing apparatus 300 completes the pooling process of the N regions of interest, the P regions of interest corresponding to the P computing units 320 one by one, and P is a positive integer less than or equal to S.
And under the condition that the number of the interested areas on one input feature map is larger than N, the P interested areas are the interested areas which are not subjected to pooling processing on the current input feature map.
In the case that the input data of the ROI-posing layer is a plurality of input feature maps (L shown in fig. 1), the P regions of interest may be regions of interest on the next input feature map.
In the case that the number of the regions of interest on one input feature map is greater than N, and the input data of the ROI-posing layer is multiple input feature maps, the P regions of interest may be regions of interest on the current input feature map that are not pooled, or may be regions of interest on the next input feature map.
In other words, after each configuration instruction is executed, whether the next instruction is to preferentially switch the next input feature map or to preferentially switch the P regions of interest that are not pooled on the current input feature map can be dynamically configured through the instruction. In practical application, the calculation rates of the two switching sequences and the requirements on the bandwidth can be analyzed according to actual requirements, so that a better switching sequence is selected.
Based on the above description, the computing apparatus provided by the present application may support implementing parallel pooling processing of multiple regions of interest by including multiple computing units, and thus, may improve processing efficiency of the ROI-posing layer.
It should be understood that the units of resolution or size of an image or region referred to herein are pixels. For example, in FIG. 2, the resolution of the input feature map is 8 × 8 (unit: pixel), the resolution of the region of interest is 7 × 5 (unit: pixel), and the size of the output pooling box is 2 × 2 (unit: pixel).
The computing device 300 provided herein can be an Application Specific Integrated Circuit (ASIC) or a field-programmable gate array (FPGA).
The computing apparatus 300 provided in the present application may be applied to implement a hard acceleration function of a ROI-posing layer in a Convolutional Neural Network (CNN).
As one example, the computing apparatus 300 may be applied to Intellectual Property (IP) cores and to cooperative circuitry between IP cores.
As another example, the computing device 300 may also be employed in other types of neural network accelerators or processors that include a ROI-posing layer.
The device embodiments provided herein are described above, and the method embodiments provided herein are described below. It should be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore, for brevity, details are not repeated here, and details that are not described in detail may be referred to the above apparatus embodiments.
Fig. 11 is a schematic flowchart of a method for calculating a ROI-posing layer according to an embodiment of the present disclosure. For example, the calculation method may be performed by the calculation apparatus 300 of the above embodiment. The calculation method includes the following steps S1110 and S1120.
S1110, obtaining configuration information indicating positions of N regions of interest, where N is a positive integer.
And S1120, performing parallel pooling on the N interested areas according to the configuration information to obtain output data of the corresponding interested areas.
In step S1120, pooling of the N regions of interest is included, wherein the method of pooling the first region of interest comprises steps S410 to S440 as described above. For brevity, no further description is provided herein. It is to be understood that the first region of interest represents each of the N regions of interest.
Optionally, in the embodiment shown in fig. 11, step S1120, in combination with step S440 described above, may specifically include: acquiring a column processing result of each row of data falling into the data window area; and performing row processing on the column processing result to obtain output data of the data window area.
Optionally, in the embodiment shown in fig. 11, step S1120, in combination with step S440 described above, may specifically include: if the first region of interest comprises a first data window region and a second data window region with line overlapping regions, caching a first column processing result of the line overlapping regions in the process of acquiring output data of the first data window region; and in the process of calculating the output data of the second data window region, performing row processing on the data in the second data window region except for the row overlapping region to obtain a second row processing result, and performing row processing on the second row processing result and the cached first row processing result to obtain the output data of the second data window.
Optionally, in the embodiment shown in fig. 11, step S1120, in combination with step S440 described above, may specifically include: if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching line processing results of first column processing results of the line overlapping regions in the process of acquiring output data of the first data window region; in the process of calculating the output data of the second data window region, row processing is performed on the data of the second data window region except the row overlapping region to obtain a second row processing result, and the row processing result of the second row processing result and the row processing result of the cached first row processing result of the row overlapping region are processed to obtain the output data of the second data window.
Optionally, in the embodiment shown in fig. 11, step S1120 includes: the method comprises the step of conducting parallel pooling processing on N interested areas through N computing units in a computing device comprising S computing units, wherein the N computing units correspond to the N interested areas one by one, S is an integer larger than 1, and N is an integer smaller than S.
It can be understood that, in addition to performing parallel pooling processing on the N regions of interest by using the N computing units on a hardware level, the parallel pooling processing may also be performed on a software level to improve the processing efficiency of the regions of interest, which is not specifically limited herein.
Such as computing device 300 in the embodiments above. The N calculation units are, for example, the N calculation units 320 in the above embodiment.
For example, the computing unit processes one pixel per clock cycle.
Optionally, the computing unit includes an arithmetic module, configured to perform arithmetic processing on the data falling into the corresponding data window.
For example, the number of calculation modules comprised by the calculation unit is related to the width of the pooled output blocks.
The operation module is, for example, the operation module 324 in the above embodiment.
Optionally, in the embodiment shown in fig. 11, the method further includes performing, through the data input interface, the following steps: acquiring a start position indicating the input feature map in an external storage device and configuration information indicating the resolution of the input feature map; and reading the data of the input feature map from an external storage device according to the starting position and the resolution of the input feature map, and broadcasting the read data of the input feature map to the N computing units.
The data input interface is, for example, the data input interface 330 in the above embodiments.
Optionally, in the embodiment shown in fig. 11, reading data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map includes: and reading the data of the input characteristic diagram from the external storage device in parallel in a line main sequence. Broadcasting the read data of the input feature map to the N calculation units, including: caching the data of the input characteristic diagram read in parallel into a cache unit; performing parallel-serial conversion processing on the data of the input characteristic diagram in the cache unit; and broadcasting the data of the input characteristic diagram obtained by the parallel-serial conversion processing to the N computing units.
Optionally, the cache unit is separated from the computing unit, and meanwhile, the cache unit may be disposed separately from the data input interface or may be disposed integrally. For example, the buffer unit may be located in the data input interface.
For example, the computing device includes a number S of computing units related to the granularity of data read by the data input interface and the number of pixels processed by the computing units per clock cycle.
Optionally, in the embodiment shown in fig. 11, the calculating method further includes: and outputting the output data of the N interested areas to an external storage device through a data output interface.
For example, the calculation method further includes: and sequentially transmitting the output data of the N interested areas to the data output interface according to a preset sequence through an arbitration unit.
The data output interface is, for example, the data output interface 340 in the above embodiment, and the arbitration unit is, for example, the arbitration unit 350 in the above embodiment.
Optionally, in the embodiment shown in fig. 11, after completing the pooling of the N regions of interest, the calculating method further includes: and acquiring configuration information indicating the positions of P interested areas, wherein the P interested areas are interested areas which are not subjected to pooling processing on the current input feature map, or the P interested areas are interested areas on the next input feature map, and P is a positive integer.
And under the condition that the number of the interested areas on one input feature map is larger than N, the P interested areas are the interested areas which are not subjected to pooling processing on the current input feature map.
In the case that the input data of the ROI-posing layer is a plurality of input feature maps (L shown in fig. 1), the P regions of interest may be regions of interest on the next input feature map.
In the case that the number of the regions of interest on one input feature map is greater than N, and the input data of the ROI-posing layer is multiple input feature maps, the P regions of interest may be regions of interest on the current input feature map that are not pooled, or may be regions of interest on the next input feature map.
In other words, after each configuration instruction is executed, whether the next instruction is to preferentially switch the next input feature map or to preferentially switch the P regions of interest that are not pooled on the current input feature map can be dynamically configured through the instruction. In practical application, the calculation rates of the two switching sequences and the requirements on the bandwidth can be analyzed according to actual requirements, so that a better switching sequence is selected.
It is described above that in the embodiment shown in fig. 11, step S1120 is implemented by N calculation units. Alternatively, the step S1120 may be implemented by software.
Fig. 12 is a schematic block diagram of a neural network system 1200 provided in an embodiment of the present application, where the neural network system 1200 includes a computing device 1210 of a region-of-interest-pooling layer, and the computing device 1210 is the computing device 300 in the above embodiment.
It should be understood that the neural network system 1200 may also include computing devices 1220 of other neural network layers.
For example, computing device 1220 includes any one or more of the following: a convolutional layer computing device, an active layer computing device, a pooling layer computing device, and a fully connected layer computing device.
The computing devices referred to herein may also be referred to as hardware accelerators.
It is understood that the beneficial effects of the ROI-posing layer calculation method and the neural network system provided herein can refer to the description of the region of interest-pooling layer calculation apparatus in the above embodiments, and are not described herein again.
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware or any other combination. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (34)
1. A region-of-interest-pooling layer computing apparatus, wherein the computing apparatus comprises a configuration interface and S computing units, wherein S is an integer greater than 1;
the configuration interface is configured to transmit configuration information indicating positions of N regions of interest to N of the S computing units, where the N regions of interest are in one-to-one correspondence with the N computing units, and N is a positive integer less than or equal to S;
each of the N calculation units is configured to pool an area of interest corresponding thereto, and obtain output data corresponding to the area of interest.
2. The computing device according to claim 1, wherein a first computing unit of the N computing units is configured to pool a first region of interest, obtaining output data of the first region of interest;
wherein, the pooling the first region of interest to obtain the output data of the first region of interest includes:
acquiring data of an input feature map, wherein the input feature map comprises K interested areas, and K is a positive integer not less than N;
according to the position of the first region of interest and the resolution of the pooling output frame, obtaining a data window area corresponding to the data to be output of the first region of interest on the first region of interest;
selecting data falling into the data window area from the acquired data of the input feature map;
and carrying out operation processing on the data falling into the data window area to obtain the output data of the data window area.
3. The computing device of claim 2, wherein the performing the arithmetic processing on the data falling into the data window region to obtain the output data of the data window region comprises:
acquiring a column processing result of each row of data falling into the data window area;
and performing row processing on the column processing result to obtain output data of the data window area.
4. The computing device of claim 2, wherein the performing the arithmetic processing on the data falling into the data window region to obtain the output data of the data window region comprises:
if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching a first column processing result of the line overlapping regions in the process of acquiring output data of the first data window region;
in the process of calculating the output data of the second data window region, row processing is performed on the data of the second data window region except for the row overlapping region to obtain a second row processing result, and the second row processing result and the cached first row processing result are subjected to row processing to obtain the output data of the second data window.
5. The computing device of claim 2, wherein the performing the arithmetic processing on the data falling into the data window region to obtain the output data of the data window region comprises: the method comprises the following steps:
if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching line processing results of first column processing results of the line overlapping regions in the process of acquiring output data of the first data window region;
in the process of calculating the output data of the second data window region, performing row processing on the data in the second data window region except for the row overlapping region to obtain a second row processing result, and performing row processing on the second row processing result and the cached row processing result of the first row processing result in the row overlapping region to obtain the output data of the second data window.
6. The computing device of any of claims 1-5, further comprising a data input interface;
wherein the configuration interface is further configured to transmit, to the data input interface, configuration information indicating a starting position of the input feature map in an external storage device and indicating a resolution of the input feature map;
the data input interface is configured to:
reading data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;
broadcasting the read data of the input feature map to the N computing units.
7. The computing device of claim 6, further comprising a cache unit;
wherein the data input interface is configured to:
reading data of the input feature map from the external storage device in parallel in a line master order;
caching the data of the parallel read input feature map into the cache unit;
performing parallel-serial conversion processing on the data of the input characteristic diagram in the cache unit;
and broadcasting the data of the input characteristic diagram obtained by the parallel-serial conversion processing to the N computing units.
8. The computing device of claim 6 or 7, further comprising:
a data output interface configured to output the output data calculated by the N calculation units to the external storage device.
9. The computing device of claim 8, further comprising:
and the arbitration unit is configured to sequentially transmit the output data obtained by calculation of the N calculation units to the data output interface according to a preset sequence.
10. The computing device of any of claims 6 to 9, wherein S is related to a granularity of data read by the data input interface and a number of pixels processed by the computing unit per clock cycle.
11. The computing device of claim 10, wherein the computing unit processes one pixel per clock cycle.
12. The computing device according to any one of claims 2 to 5, wherein the first computing unit includes:
and the operation module is configured to perform the operation processing on the data falling into the data window area to obtain the output data of the data window area.
13. The computing device of claim 12, wherein the number of operational modules is related to a width of the pooled output blocks.
14. The computing device of any of claims 2 to 5, wherein the first computing unit further comprises:
a storage module configured to buffer the received data of the input feature map.
15. The computing apparatus of any of claims 1 to 14, wherein the configuration interface is configured to:
after the computing device completes the pooling of the N interesting regions, transmitting configuration information indicating positions of P interesting regions to P computing units in the S computing units, wherein the P interesting regions are in one-to-one correspondence with the P computing units, and P is a positive integer less than or equal to S;
the P regions of interest are regions of interest on the current input feature map which are not subjected to pooling processing, or the P regions of interest are regions of interest on the next input feature map.
16. The computing device of any one of claims 1 to 15, wherein the computing device is an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
17. A method of region-of-interest-pooling layer calculation, comprising:
acquiring configuration information indicating positions of N interested areas on an input feature map, wherein N is a positive integer;
and according to the configuration information, performing parallel pooling processing on the N interested areas to obtain output data of the corresponding interested areas.
18. The computing method of claim 17, wherein performing parallel pooling of the N regions of interest to obtain output data for respective regions of interest comprises:
pooling a first region of interest to obtain output data of the first region of interest;
wherein, the pooling the first region of interest to obtain the output data of the first region of interest includes:
acquiring data of an input feature map, wherein the input feature map comprises K interested areas, and K is a positive integer not less than N;
according to the position of the first region of interest and the resolution of the pooling output frame, obtaining a data window area corresponding to the data to be output of the first region of interest on the first region of interest;
selecting data falling into the data window area from the acquired data of the input feature map;
and carrying out operation processing on the data falling into the data window area to obtain the output data of the data window area.
19. The computing method of claim 18, wherein the performing the operation on the data falling into the data window region to obtain the output data of the data window region comprises:
acquiring a column processing result of each row of data falling into the data window area;
and performing row processing on the column processing result to obtain output data of the data window area.
20. The computing method of claim 18, wherein the performing the operation on the data falling into the data window region to obtain the output data of the data window region comprises:
if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching a first column processing result of the line overlapping regions in the process of acquiring output data of the first data window region;
in the process of calculating the output data of the second data window region, row processing is performed on the data of the second data window region except for the row overlapping region to obtain a second row processing result, and the second row processing result and the cached first row processing result are subjected to row processing to obtain the output data of the second data window.
21. The computing method of claim 18, wherein the performing the operation on the data falling into the data window region to obtain the output data of the data window region comprises:
if the first region of interest comprises a first data window region and a second data window region which are provided with line overlapping regions, caching line processing results of first column processing results of the line overlapping regions in the process of acquiring output data of the first data window region;
in the process of calculating the output data of the second data window region, performing row processing on the data in the second data window region except for the row overlapping region to obtain a second row processing result, and performing row processing on the second row processing result and the cached row processing result of the first row processing result in the row overlapping region to obtain the output data of the second data window.
22. The computing method according to any one of claims 17 to 21, wherein said parallel pooling of said N regions of interest comprises:
performing parallel pooling processing on the N interested areas through N computing units in a computing device comprising S computing units, wherein the N computing units correspond to the N interested areas one by one, S is an integer larger than 1, and N is smaller than or equal to S.
23. The computing method of claim 22, further comprising:
through the data input interface:
acquiring a starting position indicating the input feature map in an external storage device and configuration information indicating the resolution of the input feature map;
reading data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map;
broadcasting the read data of the input feature map to the N computing units.
24. The computing method of claim 23, wherein reading the data of the input feature map from the external storage device according to the starting position and the resolution of the input feature map comprises:
reading data of the input feature map from the external storage device in parallel according to the starting position and the resolution of the input feature map in a line main order;
the broadcasting the read data of the input feature map to the N calculation units includes:
caching the data of the parallel read input characteristic diagram into a cache unit;
performing parallel-serial conversion processing on the data of the input characteristic diagram in the cache unit;
and broadcasting the data of the input characteristic diagram obtained by the parallel-serial conversion processing to the N computing units.
25. The computing method of any of claims 22 to 24, wherein S is related to a granularity of data read by the data input interface and a number of pixels processed by the computing unit per clock cycle.
26. The computing method of claim 25, wherein the computing unit processes one pixel per clock cycle.
27. The computing method according to any one of claims 22 to 26, wherein the computing unit comprises an arithmetic module for performing arithmetic processing on data falling into respective data windows in the region of interest.
28. The computing method of claim 27, wherein the computing unit includes a number of the computing modules related to a width of a pooled output box.
29. The computing method according to any one of claims 22 to 28, wherein the computing unit further comprises a caching module configured to cache the acquired data of the input feature map.
30. The computing method according to any one of claims 17 to 29, further comprising:
and outputting the output data of the N interested areas to an external storage device through a data output interface.
31. The computing method of claim 30, further comprising:
and sequentially transmitting the output data of the N interested areas to the data output interface according to a preset sequence through an arbitration unit.
32. The computing method according to any one of claims 17 to 31, wherein after completion of the pooling of the N regions of interest, the computing method further comprises:
acquiring configuration information indicating positions of P interested areas, wherein the P interested areas are interested areas which are not subjected to pooling processing on the current input feature map, or the P interested areas are interested areas on the next input feature map, and P is a positive integer.
33. The computing method according to any one of claims 22 to 29, wherein the computing device is an application specific integrated circuit, ASIC, or a field programmable gate array, FPGA.
34. A neural network system, comprising:
the region of interest-pooling layer computing device of any of claims 1-16.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/118933 WO2021092941A1 (en) | 2019-11-15 | 2019-11-15 | Roi-pooling layer computation method and device, and neural network system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112313673A true CN112313673A (en) | 2021-02-02 |
Family
ID=74336509
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201980039309.2A Pending CN112313673A (en) | 2019-11-15 | 2019-11-15 | Region-of-interest-pooling layer calculation method and device, and neural network system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112313673A (en) |
WO (1) | WO2021092941A1 (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108229645A (en) * | 2017-04-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | Convolution accelerates and computation processing method, device, electronic equipment and storage medium |
US20180232629A1 (en) * | 2017-02-10 | 2018-08-16 | Kneron, Inc. | Pooling operation device and method for convolutional neural network |
CN110210490A (en) * | 2018-02-28 | 2019-09-06 | 深圳市腾讯计算机系统有限公司 | Image processing method, device, computer equipment and storage medium |
CN110383330A (en) * | 2018-05-30 | 2019-10-25 | 深圳市大疆创新科技有限公司 | Pond makeup is set and pond method |
CN110399977A (en) * | 2018-04-25 | 2019-11-01 | 华为技术有限公司 | Pond arithmetic unit |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102500836B1 (en) * | 2016-09-27 | 2023-02-16 | 한화테크윈 주식회사 | Method and apparatus for processing wide angle image |
-
2019
- 2019-11-15 WO PCT/CN2019/118933 patent/WO2021092941A1/en active Application Filing
- 2019-11-15 CN CN201980039309.2A patent/CN112313673A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180232629A1 (en) * | 2017-02-10 | 2018-08-16 | Kneron, Inc. | Pooling operation device and method for convolutional neural network |
CN108229645A (en) * | 2017-04-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | Convolution accelerates and computation processing method, device, electronic equipment and storage medium |
CN110210490A (en) * | 2018-02-28 | 2019-09-06 | 深圳市腾讯计算机系统有限公司 | Image processing method, device, computer equipment and storage medium |
CN110399977A (en) * | 2018-04-25 | 2019-11-01 | 华为技术有限公司 | Pond arithmetic unit |
CN110383330A (en) * | 2018-05-30 | 2019-10-25 | 深圳市大疆创新科技有限公司 | Pond makeup is set and pond method |
Also Published As
Publication number | Publication date |
---|---|
WO2021092941A1 (en) | 2021-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3496007B1 (en) | Device and method for executing neural network operation | |
US20200285446A1 (en) | Arithmetic device for neural network, chip, equipment and related method | |
US20210073569A1 (en) | Pooling device and pooling method | |
CN109214504B (en) | FPGA-based YOLO network forward reasoning accelerator design method | |
US11734554B2 (en) | Pooling processing method and system applied to convolutional neural network | |
US20200134435A1 (en) | Computation apparatus, circuit and relevant method for neural network | |
CN114501024B (en) | Video compression system, method, computer readable storage medium and server | |
CN115460414B (en) | Video compression method and system of baseboard management control chip and related components | |
US20220012587A1 (en) | Convolution operation method and convolution operation device | |
CN103793873A (en) | Obtaining method and device for image pixel mid value | |
CN109416743B (en) | Three-dimensional convolution device for identifying human actions | |
US10430339B2 (en) | Memory management method and apparatus | |
CN110490312B (en) | Pooling calculation method and circuit | |
CN112313673A (en) | Region-of-interest-pooling layer calculation method and device, and neural network system | |
WO2023142715A1 (en) | Video coding method and apparatus, real-time communication method and apparatus, device, and storage medium | |
CN116934573A (en) | Data reading and writing method, storage medium and electronic equipment | |
WO2023184754A1 (en) | Configurable real-time disparity point cloud computing apparatus and method | |
CN116912556A (en) | Picture classification method and device, electronic equipment and storage medium | |
CN111083479A (en) | Video frame prediction method and device and terminal equipment | |
WO2020107319A1 (en) | Image processing method and device, and video processor | |
CN115391053A (en) | Online service method and device based on CPU and GPU hybrid calculation | |
CN115499667B (en) | Video processing method, device, equipment and readable storage medium | |
CN111161135B (en) | Picture conversion method and related product | |
CN212873459U (en) | System for data compression storage | |
WO2023284479A1 (en) | Plane estimation method and apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210202 |