CN114330635A - Device and method for scaling and accelerating data of neural network - Google Patents

Device and method for scaling and accelerating data of neural network Download PDF

Info

Publication number
CN114330635A
CN114330635A CN202011072023.5A CN202011072023A CN114330635A CN 114330635 A CN114330635 A CN 114330635A CN 202011072023 A CN202011072023 A CN 202011072023A CN 114330635 A CN114330635 A CN 114330635A
Authority
CN
China
Prior art keywords
module
data
sampling
processing
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011072023.5A
Other languages
Chinese (zh)
Inventor
刘敏丽
张楠赓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canaan Bright Sight Co Ltd
Original Assignee
Canaan Bright Sight Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canaan Bright Sight Co Ltd filed Critical Canaan Bright Sight Co Ltd
Priority to CN202011072023.5A priority Critical patent/CN114330635A/en
Publication of CN114330635A publication Critical patent/CN114330635A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides a device and a method for scaling and accelerating data of a neural network, wherein the accelerating device comprises: the control module is used for acquiring a control instruction, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise an up-sampling module and a down-sampling module; the internal cache module is used for caching the feature map data obtained by the convolution calculation; the data reading module is used for reading a first feature map in the feature map data from the internal cache module to the target processing module based on the processing instruction; the target processing module is used for processing the first characteristic diagram based on the processing instruction to obtain a processing result; and the data writing-out module is used for writing back the processing result to the internal cache module based on the processing instruction. By using the method, the two functions of down sampling and up sampling in the convolutional neural network are combined and realized, the same data reading and writing logic is multiplexed, the occupied chip area is small, and the power consumption is smaller.

Description

Device and method for scaling and accelerating data of neural network
Technical Field
The invention belongs to the technical field of neural networks, and particularly relates to a device and a method for accelerating data scaling of a neural network.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Convolutional Neural Networks (CNN) is a deep feedforward artificial Neural network, and has been used in many fields, such as image recognition. The convolutional neural network performs a more complicated convolution calculation in the process of processing the image data, and after the convolution operation is performed, a down-sampling operation may be performed to down-sample the image to reduce the size of the image, or an up-sampling operation may be performed to the image to enlarge the size of the image.
In the prior art, for specific implementation of special operations in a convolutional neural network such as the above-mentioned down-sampling operation, up-sampling operation, etc., one of the methods is to implement the operations such as down-sampling and up-sampling separately by hardware, however, a single hardware implementation for each operation will result in an increase in chip area, thereby increasing production cost, and it is difficult to configure the hardware at will as required, for example, only 2x2/3x3/4x4 down-sampling or up-sampling with a magnification of 2/4/8/16 can be implemented. In another method, existing chips such as a CPU and a GPU are directly called to indirectly implement the above operations, however, the CPU or the GPU is a general hardware accelerator, which is not specially designed for operations in a neural network, and has a low execution efficiency, and in addition, communication with the CPU or the GPU also causes poor timeliness, and the performance of the whole chip is low.
Disclosure of Invention
The method aims at the problem that the up-sampling operation and the down-sampling operation in the convolutional neural network in the prior art are poor in implementation effect. The embodiment of the invention provides a device and a method for scaling and accelerating data of a neural network. With such an accelerating device and method, the above-described problems can be solved.
The following schemes are provided in the examples of the present invention.
In a first aspect, an apparatus for scaling and accelerating data in a neural network is provided, including: determining a target processing module from at least two processing modules based on the processing instruction, the at least two processing modules including an upsampling module and a downsampling module; the internal cache module is used for caching the feature map data obtained by the convolution calculation; the data reading module is used for reading a first feature map in the feature map data from the internal cache module to the target processing module based on the processing instruction; the target processing module is used for processing the first characteristic diagram based on the processing instruction to obtain a processing result; and the data writing-out module is used for writing back the processing result to the internal cache module based on the processing instruction.
In some embodiments, the at least two processing modules further include the data writing-out module, and when the target processing module is the data writing-out module, the data reading module is further configured to read out a second feature map from the feature map data cached in the internal cache module according to the predetermined dimensional sequence based on the processing instruction, and input the second feature map to the data writing-out module; and the data writing-out module is also used for sequentially writing the second characteristic diagram back to the internal cache module.
In some embodiments, an internal cache module comprising a plurality of memory regions accessible in parallel; and the data reading module comprises a plurality of groups of reading logic circuits which are arranged in parallel and is used for reading the required characteristic diagram data from a plurality of storage areas of the internal cache module in parallel and inputting the characteristic diagram data into the down-sampling module or the up-sampling module or the data writing-out module in parallel.
In some embodiments, the down-sampling module comprises a plurality of computing units arranged in parallel, the plurality of computing units are connected to the plurality of groups of reading logic circuits in parallel, and are used for receiving the first feature maps in parallel and executing multi-channel down-sampling operation in parallel; each computing unit is used for performing width-direction down-sampling operation on the read data, and the parallel computing units are used for performing height-direction down-sampling operation on the width-direction down-sampling result.
In some embodiments, each computing unit of the down-sampling module includes a first-in-first-out memory for storing source pixels of each down-sampling window width size in a single channel.
In some embodiments, when the target processing unit is the down-sampling module, the multiple sets of read logic circuits of the data reading module send multiple read requests in parallel to the multiple storage areas in response to the processing instruction, where each read request carries a number of a corresponding computing unit; if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the computing units with high numbers can enter the read request sending of the computing units with high numbers after the read requests of the computing units with low numbers respond.
In some embodiments, the down-sampling operation comprises: maximum downsampling, minimum downsampling, sum downsampling, and mean downsampling of any window size.
In some embodiments, the upsampling operation performed by the upsampling module comprises: nearest neighbor interpolation or bilinear interpolation.
In some embodiments, the upsampling module is further configured to: and the ROI align algorithm in the MaskRCNN network is compatible and realized.
In a second aspect, a method for scaling and accelerating data in a neural network is provided, which is performed by the apparatus for scaling and accelerating data in a neural network of the first aspect, and the method includes: acquiring a control instruction through a control module, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise an up-sampling module and a down-sampling module; obtaining characteristic diagram data of the cached convolution calculation through an internal cache module; reading a first feature map in the feature map data from an internal cache module to a target processing module through a data reading module and based on a processing instruction; processing the first characteristic diagram through the target processing module based on the processing instruction to obtain a processing result; and writing the processing result back to the internal cache module through the data writing-out module based on the processing instruction.
In some embodiments, the at least two processing modules further comprise the data write-out module, and when the target processing module is the data write-out module, the method further comprises: reading a second characteristic diagram from the characteristic diagram data cached by the internal caching module according to a preset dimension sequence through the data reading module based on the processing instruction, and inputting the second characteristic diagram into the data writing module; the second feature map is sequentially written back to the internal cache module by the data write-out module based on the processing instruction.
In some embodiments, an internal cache module comprising a plurality of memory regions accessible in parallel; the data reading module comprises a plurality of groups of reading logic circuits which are arranged in parallel, and the method also comprises the following steps: and the characteristic diagram data is read in parallel from a plurality of storage areas of the internal cache module through a plurality of groups of reading logic circuits of the data reading module and is input into the down-sampling module or the up-sampling module or the data writing-out module in parallel.
In some embodiments, the down-sampling module comprises a plurality of computational units arranged in parallel; the plurality of computational units are connected in parallel to the plurality of sets of read logic circuits, the method further comprising: the multiple computing units receive the first characteristic diagram in parallel by utilizing the multiple groups of reading logic circuits and execute the down-sampling operation of multiple channels in parallel; each computing unit executes the down-sampling operation in the width direction on the read feature map, and the parallel computing units are used for carrying out the down-sampling operation in the height direction on the down-sampling result in the width direction.
In some embodiments, each computing unit of the downsampling module includes a first-in-first-out memory to store source pixels of the downsampling window width size for each row in a single channel.
In some embodiments, the multiple sets of read logic circuits of the data reading module send multiple read requests in parallel to the multiple storage areas in response to the down-sampling processing instruction, where each read request carries a number of a corresponding computing unit; if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the computing units with high numbers can enter the read request sending of the computing units with high numbers after the read requests of the computing units with low numbers respond.
In some embodiments, the down-sampling operation comprises: maximum downsampling, minimum downsampling, sum downsampling, and mean downsampling of any window size.
In some embodiments, the upsampling operation performed by the upsampling module comprises: nearest neighbor interpolation or bilinear interpolation.
In some embodiments, the method further comprises: and an up-sampling module is used for compatibly realizing the ROI align algorithm in the MaskRCNN network.
The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: the two functions of down sampling (also called pooling) and up sampling in the convolutional neural network are combined and realized, the two functions multiplex the same data reading and writing logic, the occupied chip area is small, and the power consumption is small.
It should be understood that the above description is only an overview of the technical solutions of the present invention, so as to clearly understand the technical means of the present invention, and thus can be implemented according to the content of the description. In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
The advantages and benefits described herein, as well as other advantages and benefits, will be apparent to those of ordinary skill in the art upon reading the following detailed description of the exemplary embodiments. The drawings are only for purposes of illustrating exemplary embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like elements throughout. In the drawings:
FIG. 1 is a schematic diagram of an apparatus for accelerating data scaling of a neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an internal cache unit according to an embodiment of the present invention;
FIG. 3 is a block diagram of a down-sampling module according to an embodiment of the invention;
FIG. 4 is a schematic diagram of an upsampling operation according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating a method for accelerating data scaling of a neural network according to an embodiment of the invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In the present invention, it is to be understood that terms such as "including" or "having," or the like, are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 is a schematic structural diagram of an apparatus 100 for accelerating data scaling of a neural network according to an embodiment of the present application.
As shown in fig. 1, an apparatus 100 for accelerating data scaling of a neural network includes: a control module, an internal buffer module 101, a data reading module 102, at least two processing modules (including at least a down-sampling module 103, an up-sampling module 104), and a data writing-out module 105.
The control module is used for acquiring a control instruction, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise an up-sampling module and a down-sampling module; specifically, the processing instruction obtained by the analysis of the control module may be a down-sampling processing instruction or an up-sampling processing instruction, and when the processing instruction is the down-sampling processing instruction, the target processing module is a down-sampling module; when the processing instruction is an up-sampling processing instruction, the target processing module is an up-sampling module.
An internal buffer module 101, configured to buffer a feature map (feature map) obtained by the convolution calculation. The feature map is specifically a result of convolution calculation, and has multiple dimensions such as W (width), H (height), C (channel), and N (number of frames).
And the data reading module 102 is configured to read a first feature map in the feature map data from the internal cache module to the determined target processing module based on the processing instruction. When the processing instruction is a down-sampling instruction, the first feature map selected from the feature map data is the feature map to be down-sampled indicated by the processing instruction, and the feature map to be down-sampled may be read from the internal cache module 101 to the down-sampling module 103 based on the processing instruction. When the processing instruction is an upsampling instruction, the first feature map selected from the feature map data is the feature map to be upsampled indicated by the processing instruction, and the required feature map to be upsampled can be read from the internal cache module 101 to the upsampling module 104 based on the processing instruction.
And the target processing module is used for processing the first feature map based on the processing instruction to obtain a processing result.
When the processing instruction is a down-sampling instruction, the first feature map is a feature map to be down-sampled, the down-sampling module 103 performs down-sampling operation on the first feature map, and writes out the processing result input data into the module 105 after the down-sampling operation is completed. When the processing instruction is an upsampling instruction, the first feature map is an upsampling feature map, the upsampling module 104 performs an upsampling operation on the first feature map, and after the upsampling operation is completed, the processing result is input into the data writing module 105.
And the data writing-out module 105 is used for writing the processing result back to the internal cache module 101.
In this embodiment, two functions of down-sampling (also called down-sampling) and up-sampling in the convolutional neural network are combined to be implemented, when down-sampling operation needs to be performed, the down-sampling module is controlled to be started through a down-sampling processing instruction, the data reading module 102 is controlled to read a feature map to be down-sampled, which needs to be down-sampled, from the internal cache module 101 and transmit the feature map to be down-sampled to the down-sampling module, and after the down-sampling operation is performed, the data writing module 105 is controlled to write a result back to the internal cache module 101. When the upsampling operation needs to be executed, the upsampling module is controlled to be started through an upsampling processing instruction, the data reading module 102 is controlled to read the feature map to be upsampled, which needs to be upsampled, from the internal cache module 101 and transmit the feature map to the upsampling module, and after the upsampling operation is executed, the data writing-back module 105 is controlled to write the result back to the internal cache module 101. The two functions can be realized by configuring each functional module in the acceleration device through a down-sampling processing instruction or an up-sampling processing instruction. That is, the down-sampling and the up-sampling functions multiplex the same data reading and writing logic, so that the occupied chip area is small, and the power consumption is relatively low.
In a possible implementation, the at least two processing modules further include the data writing-out module 105, and the target processing module may also be the data writing-out module 105. When the processing instruction is a transpose instruction, the target processing module is the data writing-out module 105, and at this time, the data reading module 102 reads out a second feature map with a predetermined dimensional sequence in a jumping manner from the feature map data cached in the internal cache module 101 in response to the processing instruction for data transpose, where the second feature map is also data to be transposed indicated by the processing instruction, and inputs the data writing-out module 105; accordingly, the data writing-out module 105 may be further configured to write back the data to be transposed to the internal cache module 101 in sequence in response to a processing instruction for data transposition. Therefore, the transposition function in the convolutional neural network and the down-sampling and up-sampling functions can be combined and realized.
In one possible implementation, the internal cache module 101 may include: a plurality of independent storage areas (banks) accessible in parallel, as shown in fig. 2, for example comprising 8 storage areas numbered bank0, bank1, …, bank 7. The data reading unit 102 includes multiple sets of reading logic circuits arranged in parallel, and is configured to read required feature map data in parallel from multiple storage areas of the internal cache module 101, where the required feature map data is a first feature map or a second feature map and is input to the down-sampling module 103 or the up-sampling module 104 or the data writing module 105 in parallel. Each set of read logic circuits can independently read the required data from any one of the independent memory regions and transmit the data to the down-sampling module 103 or the up-sampling module 104 or the data writing-out module 105. Therefore, the required data can be read from the internal cache module in parallel, the reading bandwidth is larger, and the reading speed is higher.
In one example, when implementing the transpose function, a jump reading from the internal cache module 102 and a sequential write back to the internal cache module 102 are performed. Specifically, assuming that the input is an NCHW (where W is a width dimension, H is a height dimension, C is a channel dimension, and N is a frame number dimension) placing format and the output is an NHWC placing format, the first 8 rows of read logic circuits in the data reading module 102 may be multiplexed, data is read from the C dimension first, only one point (each point is in a bf16 format) is read for each channel, when the first point of all C dimensions is completely read, data is read from the W dimension, then data is read from the H dimension, and finally data is read from the N dimension. It can be understood that the output may also adopt other formats besides the NHWC placement format, which is not specifically limited in the present application, wherein it is assumed that each storage area (bank) bit width in the internal cache module is 128bit, and corresponds to 8 pixels in the bf16 data format, so when implementing the transpose function, after the first 8 rows in the data reading module 102 read the data, 8 pixels may be spliced into one data, and the data is written into the internal cache module according to the sequence of the first row and the second row. In summary, the transpose operation reads the internal cache module according to the placer format of the NHWC, and then writes the internal cache module according to the placer format of the NCHW.
In one embodiment, the down-sampling module 103 may include a plurality of computing units arranged in parallel, the plurality of computing units being connected to the plurality of sets of read logic circuits in parallel, for receiving the required feature maps in parallel and performing a multi-channel down-sampling operation in parallel; each computing unit is used for performing width direction down-sampling operation on the read feature map, and the parallel computing units are used for performing height direction down-sampling operation on the width direction down-sampling result.
In one example, referring to fig. 2, an internal structure diagram of the internal cache module 101 is shown, which may include 8 independent RAMs, each RAM having a depth of 512 and a width of 128 bits, and being respectively numbered as bank0, bank1, …, and bank 7. Referring to fig. 3, an internal structure diagram of the down-sampling module 103 is shown, which may include 16 computing units PE 0-15 arranged in parallel, and may implement a multi-channel down-sampling operation with an arbitrary window size in parallel. Assuming that down-sampling with a window size of 3 × 3, a step of 1 × 1, and a number of columns and rows of upper, lower, left, and right padding (padding) of 0 is to be implemented, here, 16 computing units can process 5 channels in parallel, where PE0 to PE2 need to input line0 to line2 of channel 0, PE3 to PE5 need to input line0 to line2 of channel 1, and so on. Taking maximum value down-sampling/minimum value down-sampling as an example, the operation process is as follows: after the first result in the row direction is obtained by the PE0, the result is compared with the first result in the row direction of the PE1, the result is compared with the first result in the row direction of the PE2, and the process is repeated, after all three rows of data are calculated, the PE0 to the PE2 input lines 1 to line3 of the channel 0, and the PE3 to the PE5 need to input lines 1 to line3 and … … of the channel 1.
In this embodiment, the window size, the step value, and the number of rows and columns of the boundary padding (padding) of the down-sampling operation can be freely configured, and the hardware dynamically adjusts the read address of the internal cache module, the input data of the computing unit, the computing execution sequence, and the like according to the configuration parameters.
In one possible implementation, each computing unit includes a first-in-first-out (fifo) memory for storing each row of source pixels in a single channel.
In an example, taking maximum value down-sampling or minimum value down-sampling as an example, each computing unit may process a source pixel point of the feature map in a single clock cycle, that is, down-sampling is performed in a row direction, then each point computing result is sequentially sent to a next computing unit in a vertical direction, and is compared with the computing result of the computing unit to implement column direction down-sampling, where the computations in the vertical direction must be of the same channel, so as to be compared with each other.
In one possible embodiment, the down-sampling operation includes any one of maximum down-sampling, minimum down-sampling, sum down-sampling, and mean down-sampling.
When the down-sampling operation is sum down-sampling, the comparison operation in the above example of maximum value down-sampling or minimum value down-sampling may be replaced with a sum operation. Or when the down-sampling operation is averaging down-sampling, the averaging operation needs to be executed after the last calculation unit of each channel finishes calculating, and the averaging formula is as follows: the summation value of the effective points in the window/the total number of the effective points in the window can be obtained by a table lookup method, and then the summation result is directly multiplied.
In one possible implementation, the multiple sets of read logic circuits of the data reading unit send multiple read requests in parallel to the multiple storage areas in response to the down-sampling processing instruction, where each read request carries a number of a corresponding computing unit; if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the high-number computing units can be sent after the read requests of the low-number computing units respond.
In an example, as shown in fig. 3, assuming that down-sampling with a window size of 3 × 3, a step size of 1 × 1 and a row number of upper, lower, left and right padding being 0 is to be implemented, since the down-sampling module includes 16 PE units, numbers PE0 to PE 15 are sequentially set for multiple computing units set in parallel, the 16 PE units may need to read data from the internal cache module at the same time, that is, 16 read requests need to be mapped from the data reading module onto 8 storage areas (banks) of the internal cache module, the mapping formula is described in detail here, taking the PE unit partition manner in fig. 3 as an example, assuming that a Tensor (Tensor) slice address required by PE0 is addr0, a Tensor slice address required by PE1 is addr1 ═ dr2 + stride _ W (stride _ W represents the address space occupied by the W dimension of the Tensor slice), a Tensor slice address required by PE2 is addr2 + strde _ 493 _ 12 + str _ W represents the address occupied by the address of the Tensor slice 483 _ strc — 12 (str _ C represents the address occupied by str _ C _ str _ C _ str _ t _ d _ t _ d _ ) The Tensor slice address required by PE4 is addr4 ═ addr0+ stride _ c + stride _ w, and so on, and the Tensor slice addresses of 15 work units can be obtained. Therefore, it can be calculated that PE0 has a read memory area (bank) number mod (addr0,8) corresponding to the internal cache module, and the read address is addr 0/8; PE1 corresponds to the internal cache module read memory region (bank) number mod (addr1,8), read address addr1/8, … …. If PE0 and PE1 read the same storage area (bank) at a certain time, the read request corresponding to PE0 is mapped to the storage area (bank) first, and then the read request corresponding to PE1 is mapped to the storage area (bank). That is, when a plurality of PEs simultaneously read a memory area (bank), a PE unit with a small number receives a response first, and a PE unit with a large number receives a response later. The design of 16 PE units in the down-sampling module is matched with the design of 8 storage regions (banks) in the internal cache module structure, so that the maximum read bandwidth of 8 × 128bit can be ensured. The 8 storage areas (bank) of the internal cache module are designed to meet the requirements of other functions in the neural network accelerator, and if only the down-sampling operation requirement is considered, the storage areas can be replaced by 16 storage areas (bank) and can be expanded or reduced according to the actual requirement.
The down-sampling module 103 supports the bf16 input data format, and thus the comparator, adder, and multiplier therein are all floating point type.
In one possible implementation, the upsampling operation includes: nearest neighbor interpolation or bilinear interpolation.
In one example, the up-sampling module 104 may multiplex the first 4 rows of read logic of the read module, read pixel values (here, coordinates (x0, y 0)) of image coordinates (x0, y0), (x0+1, y0), (x0, y0+1), (x0+1, y0+1) from the internal buffer unit, and then execute a nearest neighbor or bilinear interpolation formula.
Wherein, the nearest interpolation formula is:
f=(x<0.5&&y<0.5)*a+(x>=0.5&&y<0.5)*b+(x<0.5&&y>=0.5)*c+(x<0.5&&y>=0.5)*d
the bilinear interpolation formula is:
f=(1-x)(1-y)*a+x(1-y)*b+y(1-x)*c+xy*d
as shown in fig. 4, in the above two formulas, a is a pixel value of coordinates (x0, y0), b is a pixel value of coordinates (x0+1, y0), c is a pixel value of coordinates (x0, y0+1), d is a pixel value of coordinates (x0+1, y0+1), the interpretation of x and y is shown in fig. 4, and f is a post-interpolation target point pixel value. And the up-sampling module sends the solved target point f into the data writing-out module, and the writing-out module writes the target point f into the internal cache module.
In one possible implementation, the upsampling module 104 may also be configured to compatibly implement the ROI align algorithm in the MaskRCNN network.
In an example, in order to be compatible with the ROI align algorithm in the MaskRCNN network, the upsampling module 104 needs to support reading four pixel values of image coordinates (x0, y0), (x0+ delta _ x, y0), (x0, y0+ delta _ y), (x0+ delta _ x, y0+ delta _ y) from the internal cache module, where delta _ x and delta _ y are spans of source pixels in horizontal and vertical directions, respectively, and can be calculated according to the active image width/target image width.
In a possible implementation manner, based on the same technical concept, an embodiment of the present invention further provides a method 500 for accelerating data scaling of a neural network, where the method 500 is performed by the apparatus 100 for accelerating data scaling of a neural network shown in fig. 1.
As shown in fig. 5, the method 500 includes:
step 501: acquiring a control instruction through a control module, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise a down-sampling module and an up-sampling module;
step 502: obtaining characteristic diagram data of the cached convolution calculation through an internal cache module;
step 503: reading a first feature map in the feature map data from the internal cache module to the target processing module through a data reading module and based on the processing instruction;
step 504: processing the first feature map through the target processing module based on the processing instruction to obtain a processing result;
step 505: and writing the processing result back to the internal cache module through the data writing-out module based on the processing instruction.
In some embodiments, the at least two processing modules further comprise a data write-out module, and when the target processing module is the data write-out module, the method further comprises: reading a second characteristic diagram from the internal cache module according to a preset dimension sequence based on a processing instruction through the data reading module, and inputting the second characteristic diagram into the data writing module; the second feature map is sequentially written back to the internal cache module by the data write-out module based on the processing instruction.
In some embodiments, an internal cache module comprising a plurality of memory regions accessible in parallel; the data reading module comprises a plurality of groups of reading logic circuits which are arranged in parallel, and the method also comprises the following steps: and the characteristic diagram data is read in parallel from a plurality of storage areas of the internal cache module through a plurality of groups of reading logic circuits of the data reading module and is input into the down-sampling module or the up-sampling module or the data writing-out module in parallel.
In some embodiments, the down-sampling module comprises a plurality of computational units arranged in parallel; the plurality of computational units are connected in parallel to the plurality of sets of read logic circuits, the method further comprising: the multiple computing units receive the first feature diagram in the feature diagram data in parallel by utilizing multiple groups of reading logic circuits and execute multi-channel down-sampling operation in parallel; each computing unit executes the down-sampling operation in the width direction on the read feature map, and the parallel computing units are used for carrying out the down-sampling operation in the height direction on the down-sampling result in the width direction.
In some embodiments, each computing unit of the downsampling module includes a first-in-first-out memory to store source pixels of the downsampling window width size for each row in a single channel.
In some embodiments, the multiple sets of read logic circuits of the data reading module send multiple read requests in parallel to the multiple storage areas in response to the down-sampling processing instruction, where each read request carries a number of a corresponding computing unit; if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the computing units with high numbers can enter the read request sending of the computing units with high numbers after the read requests of the computing units with low numbers respond.
In some embodiments, the down-sampling operation comprises: maximum downsampling, minimum downsampling, sum downsampling, and mean downsampling of any window size.
In some embodiments, the upsampling operation performed by the upsampling module comprises: nearest neighbor interpolation or bilinear interpolation.
In some embodiments, the method further comprises: and an up-sampling module is used for compatibly realizing the ROI align algorithm in the MaskRCNN network.
It should be noted that, the acceleration method in the embodiment of the present application corresponds to each aspect of the embodiment of the aforementioned acceleration device one to one, and achieves the same effect and function, and details are not described here.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (18)

1. An apparatus for scaling acceleration of data in a neural network, comprising:
the control module is used for acquiring a control instruction, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise an up-sampling module and a down-sampling module;
the internal cache module is used for caching the feature map data obtained by the convolution calculation;
a data reading module, configured to read a first feature map in the feature map data from the internal cache module to the target processing module based on the processing instruction;
the target processing module is used for processing the first feature map based on the processing instruction to obtain a processing result;
and the data writing-out module is used for writing the processing result back to the internal cache module based on the processing instruction.
2. The apparatus of claim 1, wherein the at least two processing modules further comprise the data write-out module, and wherein when the target processing module is the data write-out module,
the data reading module is further configured to read a second feature map from the feature map data cached in the internal cache module according to a predetermined dimension order based on the processing instruction, and input the second feature map to the data writing module;
the data writing-out module is further configured to write the second feature maps back to the internal cache module in sequence.
3. The apparatus of claim 2,
the internal cache module comprises a plurality of storage areas which can be accessed in parallel;
the data reading module comprises a plurality of groups of reading logic circuits which are arranged in parallel and used for reading required characteristic diagram data from a plurality of storage areas of the internal cache module in parallel and inputting the characteristic diagram data into the down-sampling module or the up-sampling module or the data writing-out module in parallel.
4. The apparatus of claim 3,
the down-sampling module comprises a plurality of computing units which are arranged in parallel, and the computing units are connected to the reading logic circuits in parallel and used for receiving the first characteristic diagram in parallel and executing multi-channel down-sampling operation in parallel;
each computing unit is used for performing width-direction down-sampling operation on read data, and the parallel computing units are used for performing height-direction down-sampling operation on the width-direction down-sampling result.
5. The apparatus of claim 4, wherein each computing unit of the downsampling module comprises a first-in-first-out memory for storing source pixels of a downsampling window width for each row in a single channel.
6. The apparatus of claim 4,
when the target processing unit is the down-sampling module, the multiple groups of read logic circuits of the data reading module respond to the processing instruction and send multiple read requests to the multiple storage areas in parallel, wherein each read request carries the number of a corresponding computing unit;
if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the high-number computing units can be sent after the read requests of the low-number computing units respond.
7. The apparatus of claim 1, wherein the down-sampling operation comprises: maximum downsampling, minimum downsampling, sum downsampling, and mean downsampling of any window size.
8. The apparatus of claim 1, wherein the upsampling module performs the upsampling operation comprising: nearest neighbor interpolation or bilinear interpolation.
9. The apparatus of claim 1, wherein the upsampling module is further configured to: and the ROI align algorithm in the MaskRCNN network is compatible and realized.
10. A method for neural network data scaling acceleration, wherein the method is performed by the apparatus for neural network data scaling acceleration of any one of claims 1-9, the method comprising:
acquiring a control instruction through a control module, analyzing the control instruction to obtain a processing instruction, and determining a target processing module from at least two processing modules based on the processing instruction, wherein the at least two processing modules comprise a down-sampling module and an up-sampling module;
obtaining characteristic diagram data of the cached convolution calculation through an internal cache module;
reading a first feature map in the feature map data from the internal cache module to the target processing module through a data reading module and based on the processing instruction;
processing the first feature map through the target processing module based on the processing instruction to obtain a processing result;
and writing the processing result back to the internal cache module through the data writing-out module based on the processing instruction.
11. The method of claim 10, wherein the at least two processing modules further comprise the data write-out module, and wherein when the target processing module is the data write-out module, the method further comprises:
reading a second feature map in the feature map data from the internal cache module according to a preset dimension sequence based on the processing instruction through the data reading module, and inputting the second feature map into the data writing module;
sequentially writing back, by the data write-out module, the second feature map to the internal cache module based on the processing instruction.
12. The method of claim 11, wherein the internal cache module comprises a plurality of memory regions accessible in parallel; the data reading module comprises a plurality of groups of reading logic circuits which are arranged in parallel, and the method further comprises the following steps:
and the characteristic diagram data are read in parallel from a plurality of storage areas of the internal cache module through the plurality of groups of read logic circuits of the data reading module and are input into the down-sampling module or the up-sampling module or the data writing-out module in parallel.
13. The method of claim 12, wherein the down-sampling module comprises a plurality of computational units arranged in parallel; the plurality of compute units connected in parallel to the plurality of sets of read logic circuits, the method further comprising:
the plurality of computing units receive the first characteristic diagram in parallel by utilizing the plurality of groups of reading logic circuits and execute multi-channel down-sampling operation in parallel;
each computing unit executes the down-sampling operation in the width direction on the read feature map, and the parallel computing units are utilized to perform the down-sampling operation in the height direction on the down-sampling result in the width direction.
14. The method of claim 13, wherein each computing unit of the downsampling module includes a first-in-first-out memory to store source pixels for each downsampling window width in a single channel.
15. The method of claim 13,
the data reading module comprises a plurality of groups of reading logic circuits, a plurality of storage areas and a plurality of data reading modules, wherein the plurality of groups of reading logic circuits respond to a down-sampling processing instruction and send a plurality of reading requests to the plurality of storage areas in parallel, and each reading request carries the number of a corresponding computing unit;
if a plurality of computing units of the down-sampling module need to send read requests to the same storage area, priority arbitration is carried out according to the numbers of the computing units, if the numbers are low, the priorities are high, the arbitrated read requests are sequentially sent to the internal cache module, and the read requests of the high-number computing units can be sent after the read requests of the low-number computing units respond.
16. The method of claim 10, wherein the down-sampling operation comprises: maximum downsampling, minimum downsampling, sum downsampling, and mean downsampling of any window size.
17. The method of claim 10, wherein the upsampling operation performed by the upsampling module comprises: nearest neighbor interpolation or bilinear interpolation.
18. The method of claim 10, further comprising:
and the up-sampling module is used for compatibly realizing the ROI align algorithm in the MaskRCNN network.
CN202011072023.5A 2020-10-09 2020-10-09 Device and method for scaling and accelerating data of neural network Pending CN114330635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011072023.5A CN114330635A (en) 2020-10-09 2020-10-09 Device and method for scaling and accelerating data of neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011072023.5A CN114330635A (en) 2020-10-09 2020-10-09 Device and method for scaling and accelerating data of neural network

Publications (1)

Publication Number Publication Date
CN114330635A true CN114330635A (en) 2022-04-12

Family

ID=81031719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011072023.5A Pending CN114330635A (en) 2020-10-09 2020-10-09 Device and method for scaling and accelerating data of neural network

Country Status (1)

Country Link
CN (1) CN114330635A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936633A (en) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 Data processing unit for transposition operation and image transposition operation method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936633A (en) * 2022-06-15 2022-08-23 北京爱芯科技有限公司 Data processing unit for transposition operation and image transposition operation method

Similar Documents

Publication Publication Date Title
Rupnow et al. High level synthesis of stereo matching: Productivity, performance, and software constraints
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
KR100503094B1 (en) DSP having wide memory bandwidth and DSP memory mapping method
CN110390382B (en) Convolutional neural network hardware accelerator with novel feature map caching module
US10001971B2 (en) Electronic apparatus having parallel memory banks
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
CN103760525A (en) Completion type in-place matrix transposition method
JP6680454B2 (en) LSI chip stacking system
US20220113944A1 (en) Arithmetic processing device
CN114330635A (en) Device and method for scaling and accelerating data of neural network
CN111028136A (en) Method and equipment for processing two-dimensional complex matrix by artificial intelligence processor
KR20000039714A (en) Texture mapping system
KR20110040103A (en) Apparatus for accessing multi-bank memory
WO2023184754A1 (en) Configurable real-time disparity point cloud computing apparatus and method
CN115829820A (en) Interpolation method, image processing method, GPU and chip
JPH07271744A (en) Parallel computer
US6727905B1 (en) Image data processing apparatus
CN112991141A (en) Frequency domain lucky imaging method based on GPU parallel acceleration
CN113095024A (en) Regional parallel loading device and method for tensor data
CN110766150A (en) Regional parallel data loading device and method in deep convolutional neural network hardware accelerator
EP0775973B1 (en) Method and computer program product of transposing data
CN116150055B (en) Data access method and device based on-chip cache and transposition method and device
CN118069099B (en) FPGA-based multi-matrix parallel pipeline transposed SAR imaging method, device, equipment and storage medium
US20230245265A1 (en) Methods and apparatus to warp images for video processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination