WO2021168944A1 - 数据缓存电路和方法 - Google Patents

数据缓存电路和方法 Download PDF

Info

Publication number
WO2021168944A1
WO2021168944A1 PCT/CN2020/080318 CN2020080318W WO2021168944A1 WO 2021168944 A1 WO2021168944 A1 WO 2021168944A1 CN 2020080318 W CN2020080318 W CN 2020080318W WO 2021168944 A1 WO2021168944 A1 WO 2021168944A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
row
register
rows
buffer
Prior art date
Application number
PCT/CN2020/080318
Other languages
English (en)
French (fr)
Inventor
郑琪霖
王绍迪
Original Assignee
杭州知存智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州知存智能科技有限公司 filed Critical 杭州知存智能科技有限公司
Priority to US16/849,913 priority Critical patent/US11216375B2/en
Publication of WO2021168944A1 publication Critical patent/WO2021168944A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to data caching, and in particular to data caching used for neural network calculations.
  • Neural network is the core of artificial intelligence technology. At present, neural networks have received extensive research and attention, and are used in many artificial intelligence applications including computer vision, speech recognition, robotics, and autonomous driving.
  • the number of neural network levels is often very large, even as many as thousands of layers, so the amount of input data and intermediate data of the neural network is also very large. Therefore, the data caching problem of neural network constitutes the bottleneck of its speed and energy efficiency.
  • a data buffer circuit configured to buffer data in a feature map calculated by a neural network, wherein the size of the convolution kernel of the neural network is K*K data, The window corresponding to the convolution kernel slides in the feature map with a step size S, K is a positive integer, S is a positive integer, the circuit includes: a buffer, the buffer includes K buffer units, wherein each buffer unit It is configured to respectively store a plurality of rows of the characteristic map, and the plurality of rows includes a corresponding row in every K rows of the characteristic map.
  • a data caching method stores data in a feature map calculated by a neural network in a buffer, wherein the size of the convolution kernel of the neural network is K*K Data, the window corresponding to the convolution kernel slides in the feature map with a step size S, the buffer includes K buffer units, K is a positive integer, and S is a positive integer.
  • the method includes: , Storing multiple rows of the feature map, and the multiple rows include a corresponding row in every K rows of the feature map.
  • FIG. 1 is a schematic diagram showing the calculation of a convolutional layer in a convolutional neural network according to an exemplary embodiment
  • FIGS. 2a and 2b are schematic diagrams showing that the window corresponding to the convolution kernel slides in the feature map according to an exemplary embodiment
  • Fig. 3 is a structural block diagram showing a system for calculation of a neural network according to an exemplary embodiment
  • FIG. 4 is a block diagram showing the structure of a data caching circuit according to the first exemplary embodiment of the present disclosure
  • FIG. 5 is a block diagram showing the structure of a data caching circuit according to a second exemplary embodiment of the present disclosure
  • FIG. 6 is a schematic diagram showing a buffer according to a second exemplary embodiment of the present disclosure.
  • FIG. 7a and 7b are schematic diagrams showing the data read mode and the data shift mode of the register group according to the second exemplary embodiment of the present disclosure
  • FIGS. 8a-8e are schematic diagrams showing example operations of the data caching circuit when the convolution kernel of the neural network slides in a row according to the second exemplary embodiment of the present disclosure
  • 9a-9e are schematic diagrams showing example operations of the data caching circuit when the convolution kernel of the neural network slides between lines according to the second exemplary embodiment of the present disclosure
  • FIG. 10 is a flowchart showing a data caching method according to an exemplary embodiment
  • FIG. 11 is a flowchart showing a data caching method according to an exemplary embodiment
  • FIG. 12 is a flowchart illustrating a data caching method according to an exemplary embodiment.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements. Such terms are only used for Distinguish one element from another.
  • first element and the second element may refer to the same instance of the element, and in some cases, based on the description of the context, they may also refer to different instances.
  • the neural network used may be a deep neural network (Deep Neural Networks, DNN).
  • a deep neural network includes an input layer, several hidden layers (middle layers), and an output layer.
  • the input layer receives input data (for example, image pixel data, audio amplitude data, etc.), and preprocesses the input data (for example, de-averaging, normalization, principal component analysis (PCA) dimensionality reduction, etc.) ) To pass the preprocessed data to the hidden layer.
  • Each of the several hidden layers receives data from the upper layer, performs calculations on the received data, and then passes the calculated data to the next layer, where the hidden layer can be, for example, a convolutional layer or a pooling layer .
  • the output layer receives data from the last hidden layer, performs calculations on the received data, and then outputs the calculation results.
  • the output layer may be, for example, a fully connected layer.
  • Convolutional Neural Network is a deep neural network in which the hidden layer includes at least one convolutional layer.
  • FIG. 1 is a schematic diagram illustrating calculation of a convolutional layer in a convolutional neural network according to an exemplary embodiment. As shown in FIG. 1, the feature map 101 and the convolution kernel 102 are subjected to convolution calculation to obtain an output matrix 103.
  • the feature map 101 is a three-dimensional matrix with a height H, a width W, and the number of channels InCh.
  • the three-dimensional matrix is composed of InCh layers with a height H and a width W.
  • H, W and InCh are positive integers respectively, and H and W may be the same or different.
  • the feature map in Fig. 1 is a three-dimensional matrix with a height of 5, a width of 5, and a number of channels of 3.
  • FIG. 1 is only exemplary, and the height, width, and number of channels of the feature map are not limited thereto.
  • the feature map is data input to the convolutional layer by the input layer or the previous hidden layer.
  • each group of data along the width in the three-dimensional matrix is called the row of the three-dimensional matrix, and the address along the width in the three-dimensional matrix is called the column address; each group in the three-dimensional matrix along the height
  • the data is called the column of the three-dimensional matrix, and the address along the height direction in the three-dimensional matrix is called the row address.
  • each group of data along the height direction in the three-dimensional matrix may also be referred to as a row of the three-dimensional matrix, and each group of data along the width direction in the three-dimensional matrix may be referred to as a column of the three-dimensional matrix.
  • the row address and column address in the three-dimensional matrix start from the address "0"
  • the row address i is the i-th row
  • the column address j is the j-th column.
  • the two-dimensional address in the three-dimensional matrix is expressed as (row address, column address).
  • the two-dimensional address of the data whose row address is i and column address is j in the three-dimensional matrix is (i, j).
  • the convolution kernel 102 is a three-dimensional matrix with a height of K, a width of K, and a channel number of InCh.
  • the number of channels of the convolution kernel 102 should be the same as the number of channels of the feature map 101 .
  • the convolution kernel in Fig. 1 is a three-dimensional matrix with a height of 3, a width of 3, and a number of channels of 3.
  • FIG. 1 is only exemplary, and the height, width, and number of channels of the convolution kernel are not limited thereto.
  • FIG. 1 only shows one convolution kernel, it should be understood that FIG. 1 is only exemplary, and the number of convolution kernels in the convolutional neural network is not limited to this.
  • the present disclosure uses (height ⁇ width) to describe the size of the feature map and the convolution kernel.
  • the size of the feature map in FIG. 1 is 5 ⁇ 5 data
  • the size of the convolution kernel is 3 ⁇ 3 data.
  • the window corresponding to the convolution kernel slides along the height or width direction with a step size S in the feature map, where the step size S is a positive integer, and S is less than K. In some embodiments, S may be 1. In other embodiments, S may be greater than one.
  • the three-dimensional matrix of the data in the feature map corresponding to the window is convolved with the convolution kernel 102 to obtain each element in the output matrix 103.
  • the matrix corresponding to the window that is, the window corresponding matrix 101a and the convolution kernel 102, are convoluted as: the window corresponding matrix 101a is multiplied by the elements at the corresponding position in the convolution kernel 102, and then all the products are added to obtain the output matrix
  • the calculation result in 103 is 103a.
  • the K rows in the feature map are selected, and the window slides in the row direction or the width direction within the K rows.
  • Figure 2a shows a schematic diagram of the window sliding in row K.
  • the window corresponding matrix is a three-dimensional matrix composed of data at window positions on all layers of the feature map.
  • the window ends the sliding in the current K rows and starts to slide in the reselected K rows.
  • “the window has been slid to the end of the K rows” means that if the window continues to slide by the step size S, it will exceed the range of the feature map. In some cases, when the window has been slid such that the last column of the window corresponding matrix overlaps the last column of the feature map, the window has slid to the end of the K rows.
  • Figure 2b shows a schematic diagram of the window sliding between rows. Similar to Figure 2a, Figure 2b is also a two-dimensional plane corresponding to height and width.
  • FIGS. 2a and 2b show that the step size of window sliding is 1, it should be understood that FIG. 2 is only an example, and the step size of window sliding in the convolutional neural network is not limited to this.
  • FIG. 3 is a structural block diagram showing a system 300 for calculation of a neural network according to an exemplary embodiment.
  • the computing system 300 includes a data buffer circuit 301 and a computing circuit 302.
  • the data buffer circuit 301 buffers input data used for neural network calculations, and outputs the buffered data to the calculation circuit 302.
  • the data cache circuit 301 caches the data for the feature map calculated by the neural network, and the calculation circuit 302 loads the data of the convolution kernel of the neural network.
  • the data buffer circuit 301 sequentially outputs the data of the window corresponding matrix to the calculation circuit 302, and the calculation circuit 302 calculates the received window corresponding matrix and the loaded convolution kernel to obtain each calculation result in the output matrix .
  • the data cache circuit 301 caches all the data of the feature map, it is desirable to reduce the storage space occupied by the feature map.
  • the present disclosure reduces the storage space occupied by the feature map by simplifying the cache addressing logic in the data cache circuit 301.
  • FIG. 4 is a block diagram showing the structure of a data caching circuit 400 according to the first exemplary embodiment of the present disclosure. As shown in FIG. 4, the circuit 400 includes a buffer 401 and a buffer controller 402.
  • the window corresponding matrices of all window positions are respectively stored in the buffer 401, and the buffer controller 402 controls the buffer 401 to output the current window corresponding matrix.
  • the window corresponding matrix of window position 1 and the window corresponding matrix of window position 2 in FIG. 2a are both stored in the buffer 401.
  • the buffer 401 When the window is located at window position 1, the buffer 401 outputs the window corresponding matrix at window position 1.
  • the buffer outputs the window corresponding matrix at window position 2. Since the window corresponding matrix at window position 1 partially overlaps with the window corresponding matrix at window position 2, the overlapped part is repeatedly stored in the buffer 401. Therefore, although the addressing logic of the buffer 401 is relatively simple at this time, a large amount of data in the feature map is repeatedly stored, resulting in a waste of storage space.
  • the three-dimensional matrix corresponding to the feature map is stored in the buffer 401, and the buffer controller 402 controls the buffer 401 to sequentially output the data corresponding to each two-dimensional address in the current window position.
  • the addresses in the feature map are sequentially output as (1,1), (1,2), (1,3), (2,1), (2,2), (2,3), (3,1), (3,2), (3,3) data.
  • no data is repeatedly stored in the buffer 401.
  • the addressing logic of the buffer 401 is more complicated at this time.
  • FIG. 5 is a block diagram showing the structure of a data caching circuit 500 according to the second exemplary embodiment of the present disclosure.
  • the circuit 500 is configured to cache data in the feature map calculated by the neural network, where the size of the convolution kernel of the neural network is K ⁇ K data, corresponding to the convolution kernel
  • the window of is slid in the feature map with a step size S, K is a positive integer, and S is a positive integer.
  • the circuit 500 includes a buffer 501 including K buffer units 5010-501 (K-1), wherein each buffer unit is configured to store a plurality of rows of the characteristic map, and the plurality of rows include the Corresponding row in every K rows of the feature map.
  • every K of the feature map is a continuous K rows in the feature map, for example, row 0-row (K-1) of the feature map, row 1-row K of the feature map.
  • the window corresponding matrix is continuous K columns in every K rows of the feature map.
  • each K lines of the feature map are respectively stored in K different cache units 5010-501 (K-1). Therefore, only the column address needs to be provided to the cache unit 5010-501 (K-1), and the cache unit 5010-501 (K-1) can output the data in one column of every K rows, without all the data in the column. The data is addressed one by one, which simplifies the addressing logic.
  • the rows of the feature map are data along the width direction of each group of the feature map, and the columns of the feature map are data along the height direction of each group of the feature map. According to other embodiments, the rows of the feature map are data along the height direction of each group of the feature map, and the columns of the feature map are data along the width direction of each group of the feature map.
  • the remainder obtained by dividing the row address by K corresponds to the sequence number of the cache unit storing the row of the feature map.
  • FIG. 6 shows data stored in the buffer unit 5010-501 (K-1) in the buffer 501 according to some embodiments.
  • the 0th row of the feature map since the remainder obtained by dividing 0 by K is 0, the 0th row of the feature map is stored in the cache unit 5010; for the Kth row of the feature map, because The remainder obtained by dividing K by K is also 0, and the Kth row of the feature map is stored in the cache unit 5010.
  • the first row and the (K+1)th row of the feature map are stored in the cache unit 5011, and the (K-1)th and (2K-1)th row of the feature map are stored in the cache unit 501( K-1).
  • the window corresponding to the convolution kernel slides in the width direction in the continuous K rows of the feature map.
  • the row address of the first row of consecutive K rows is i
  • the row address of the consecutive K rows are i, i+1, ... i+(K-1). Therefore, the remainder obtained by dividing the row address of consecutive K rows by K can be, for example, 0, 1,..., (K-1), or, for example, q, 1+q,..., (K- 1), 0, 1,..., (q-1), where q is a positive integer less than K, and is the remainder of i divided by K. Since the remainder obtained by dividing the row address by K corresponds to the sequence number of the cache unit storing the row, the consecutive K rows in the feature map are stored in K cache units, respectively.
  • the capacity of the cache unit is designed according to the size of the feature map.
  • the feature map with height H, width W, and channel number InCh has H rows, and each row contains W data in the feature map, that is, (W*InCh) data. It is specified that the integer quotient of H divided by K is M, then M rows are stored in the cache unit 5010, that is, (M*W*InCh) data of the feature map are stored.
  • the capacity of the buffer unit should be designed to be sufficient to store (M*W*InCh) data.
  • the circuit 500 further includes K register groups 5020-502 (K-1), each register group is configured to receive data from a corresponding buffer unit, and output the stored data to the calculation Circuit.
  • the register group 5020-502 (K-1) outputs the data of the matrix corresponding to the current window to the calculation circuit, wherein each register group output window corresponds to the data of the corresponding row in the matrix.
  • the output window of the register group 5020 corresponds to the 0th row of data in the matrix
  • the output window of the register group 5021 corresponds to the first row of data in the matrix
  • the output window of the register group 502 (K-1) corresponds to the (K)th row in the matrix. -1) Row data.
  • the circuit 500 further includes a cache controller 503, the cache controller 503 is configured to: select consecutive K rows in the characteristic diagram; control the cache 501 to output the data in the matrix corresponding to the window; The window is slid in the row direction in the K rows; and after the window is slid, the buffer 501 is controlled to output the last S columns of the matrix corresponding to the window.
  • the window when the window starts to slide in the selected K rows, the window is located in the 0th-(K-1)th column of the K rows, and the buffer controller 503 controls the buffer 501 to output the K rows Column 0-data in column (K-1).
  • the buffer controller 503 after each window sliding, as described above with reference to FIG. 2a, the back (KS) column of the matrix corresponding to the window before sliding overlaps the front (KS) column of the matrix corresponding to the window after sliding. Therefore, The buffer controller 503 only needs to control the buffer 501 to output the data in the last S columns that are not overlapped in the window corresponding matrix.
  • the cache controller 503 is further configured to: for each cache unit, control The cache unit outputs the data in the corresponding row column by column according to the column address.
  • the continuous K lines in the feature map are respectively stored in K cache units. Therefore, for each cache unit, the cache controller 503 selects a line stored therein.
  • the cache controller 503 controls all the cache units 5010-501 (K-1) to output data of the same column address at the same time. Since each cache unit outputs one row of the selected K rows, when all the cache units 5010-501 (K-1) output data of the same column address at the same time, the buffer 501 outputs one column of the selected K rows.
  • the buffer 501 When the window slides in the selected K rows, the buffer 501 only outputs the data of the non-overlapping last S columns in the matrix corresponding to the windows before and after the slide. Therefore, after each window slide, each buffer unit changes from the current Starting from the column address, continue to output the last S column data of the matrix corresponding to the window without going back to the first column of the matrix corresponding to the window. Therefore, when the window slides in the selected K rows, the cache unit outputs the data in the selected row column by column according to the column address without returning to the address before the current column address after each window sliding.
  • the addressing logic when the data stored in it is read out of order, the addressing logic is complicated and the data reading speed is slow; and when the data stored in it is output sequentially, the addressing logic is simple and the data read The fetching speed is accelerated. Since the cache unit sequentially outputs the data in the selected row, the addressing logic is simplified and the data reading speed is improved.
  • the window ends sliding in the currently selected K rows, and starts to reselect the feature map. Slide in K rows, where the back (KS) row in the original K row overlaps with the front (KS) row in the reselected K row.
  • the cache controller 503 is further configured to: start from the first row address of the feature map, select consecutive K rows; after the window slides to the end of the K rows, reselect the Row K from the (S+1)th row to the Kth row and S rows after the K row; control the buffer to start from the first column address and output the reselected K rows .
  • the cache controller 503 is further configured to: after the window is slid to the end of the K rows, output the first row to the S row of the K rows. Cache unit, select the next row stored in it, and for the cache unit that outputs the (S+1)th row to the Kth row in the K rows, still select the currently selected row; control each The cache unit starts from the first column address in the selected row and outputs the selected row.
  • the buffer unit 5010-501 (S-1) when the window slides in the 0th-(K-1)th row of the feature map, the buffer unit 5010-501 (S-1) outputs the K rows from the first row to the Sth row.
  • Line that is, line 0-line (S-1) of the feature map
  • the buffer unit 501S-501(K-1) outputs the line (S+1) of the K lines to the line (S+1)
  • the row of K that is, the S-th row-the (K-1)th row of the feature map.
  • the window slides to the end of the K line for the cache unit 5010, select the next line stored therein, that is, the Kth line of the feature map.
  • the cache unit 5011 select the (K+th) line of the feature map.
  • Line ..., for the buffer unit 501 (S-1), select the (K+S-1)th line of the feature map.
  • the circuit 500 further includes a multiplexer 504 configured to transfer data from each cache unit to a corresponding register group.
  • each register group receives the data in the corresponding row in the corresponding matrix of the window and outputs it to the calculation circuit.
  • the selected row of each buffer unit corresponds to a different row in the window corresponding matrix, so the data from the buffer unit should be transferred to a different register group.
  • the multiplexer 504 correspondingly changes the correspondence between the buffer unit that outputs data and the register set that receives data. For example, when the window slides in the 0th row-(K-1) row of the feature map, the buffer unit 5010 outputs the 0th row of the feature map, that is, the window corresponds to the 0th row of the matrix. At this time, the multiplexer 504 transfers the data from the cache unit 5010 to the register group 5020; when the window slides in the S-th line-(K+S-1) line of the feature map, the cache unit 5010 outputs the K-th line of the feature map, that is, The window corresponds to the (KS) row of the matrix. At this time, the multiplexer 504 transfers the data from the buffer unit 5010 to the register group 502 (KS).
  • the buffer 501 includes a random access memory RAM.
  • each register group 700 includes a write register 701 configured to receive data from a corresponding buffer unit; and a calculation register 702 configured to receive data from the write register 701 and register the The data is output to the calculation circuit.
  • the write register 701 when the register set 700 is in the data reading mode, as shown in FIG. 7a, the write register 701 receives data from the corresponding buffer unit, and the calculation register 702 outputs the registered data to the calculation circuit;
  • the write register 701 shifts the registered data to the calculation register 702.
  • the register group 700 is alternately in a data read mode and a data shift mode.
  • the calculation register 702 outputs the corresponding row of the matrix corresponding to the current window to the calculation circuit.
  • the data registered in the calculation register 702 remains unchanged; the write register 701 receives the matrix corresponding to the window before and after the sliding The non-overlapping part in the corresponding row, that is, the sliding window corresponds to the last S columns in the corresponding row of the matrix.
  • the write register 701 shifts the data received in the data read mode to the calculation register 702, and the data in the calculation register 702 is updated to the corresponding row of the corresponding matrix of the sliding window. data.
  • the write register 701 receives the data from the buffer unit, and only needs to receive the data of the non-overlapping part of the corresponding row of the matrix corresponding to the window before and after the sliding, which reduces the access to the buffer. The delay of the device.
  • the calculation register 702 includes K register units 7021-702K, and the last register unit 702K of the K register units is configured to receive data from the write register 701. Wherein, in response to receiving the data from the write register 701, each of the last (K-1) units 7022-702K of the K register units shifts the data registered therein to the previous register unit.
  • the write register 701 when the window starts to slide in the K rows of the feature map, sequentially shifts the data in the corresponding rows in the window corresponding matrix to the register units 7021-702K according to the column addresses. In particular, at the first time, the write register 701 shifts the data in the 0th column in the corresponding row to the register unit 702K; at the second time, the write register 701 shifts the data in the first column in the corresponding row Bit into the register unit 702K, the register unit 702K shifts the 0th column data in the corresponding row to the register unit 702 (K-1); ...; at the Kth moment, writing the register 701 will correspond to the row (K-1) -1) The column data is shifted to the register unit 702K, and the register unit 7022-702K also shifts the data registered in it to the previous register unit. At this time, the register unit 7021 registers the 0th column of the corresponding row, the register unit 7022 registers the first column of the corresponding row
  • the write register 701 shifts the data in the last S columns of the corresponding row in the corresponding matrix of the sliding window to the register unit 702 (K -S+1)-702K, and shift the data registered in the original register unit 702(K-S+1)-702K to the register unit 7021-702(KS).
  • the calculation register in each register group is cleared.
  • the S buffer units change the output lines, and the output lines of each register group also change. Therefore, when the window starts to slide in the new K rows, the data of the row output before the slide in the calculation register is cleared to register the data of the output row after the slide.
  • each data in the feature map includes data with the same two-dimensional address on all channels, where the two-dimensional address includes the row address and column address of each data.
  • the calculation circuit is a vector-matrix multiplication calculation circuit or a storage-calculation integrated circuit.
  • FIGS. 8a-8e are schematic diagrams showing example operations of the data caching circuit when the convolution kernel of the neural network slides in a row according to the second exemplary embodiment of the present disclosure.
  • the size of the feature map is 5 ⁇ 5 data
  • the size of the convolution kernel is 3 ⁇ 3 data
  • the step size of window sliding is 1.
  • the buffer unit 8010 outputs the 0th row of the feature map
  • the buffer unit 8011 outputs the 1st row of the feature map.
  • the buffer unit 8012 outputs the second line of the feature map.
  • the output window of the register group 8020 corresponds to row 0 of the matrix and therefore receives data from the buffer unit 8010;
  • the output window of the register group 8021 corresponds to row 1 of the matrix and therefore receives data from the buffer unit 8011;
  • the output window of the register group 8022 corresponds to the data of the matrix Line 2 therefore receives the data from the buffer unit 8012.
  • the register groups 8020-8022 are in data shift mode.
  • the write register registers the data received from the corresponding buffer unit in the last data reading mode, that is, the 0th column of the corresponding row of the characteristic map. At this time, the write register shifts the 0th column of the corresponding row of the characteristic map registered therein to the calculation register.
  • the window has been slid from window position 1 to window position 2, and window position 2 corresponds to column 1 to column 3 of row 0-row 2 of the feature map.
  • the register group 8020-8022 is in the data read mode.
  • the window corresponding to the position 1 of the output window of the calculation register corresponds to the corresponding row of the matrix, and the write register receives the data in the third column of the corresponding row of the characteristic map.
  • the window is located at window position 2.
  • the register groups 8020-8022 are all in the data shift mode.
  • the write register shifts the data in the third column of the corresponding row of the characteristic map to the calculation register, and the register unit in the calculation register also sequentially shifts the data registered in it to the previous register unit.
  • the data registered in the register group 8020-8022 is updated to the data of the window corresponding matrix at window position 2 after sliding.
  • 9a-9e are schematic diagrams showing a data caching circuit when the convolution kernel of the neural network slides between lines according to the second exemplary embodiment of the present disclosure.
  • window position 3 the window ends sliding in the 0th row-2nd row of the window; at window position 4, the window starts sliding in the 1st row-3rd row.
  • the window is located at window position 3, which corresponds to the second column to the fourth row of row 0-row 2 of the feature map.
  • the register group 8020 receives data from the buffer unit 8010; the register group 8021 receives data from the buffer unit 8011; and the register group 8022 receives data from the buffer unit 8012.
  • the window has been slid from window position 3 to window position 4.
  • the window position 4 corresponds to the 1st row-the 3rd row of the 0th column-the 2nd column of the feature map.
  • the cache unit 8010 outputs the third line of the feature map, the cache unit 8011 outputs the first line of the feature map, and the cache unit 8012 outputs the second line of the feature map.
  • the output window of the register group 8020 corresponds to row 0 of the matrix and therefore receives data from the buffer unit 8011; the output window of the register group 8021 corresponds to row 1 of the matrix and therefore receives data from the buffer unit 8012; the output window of the register group 8022 corresponds to the data of the matrix Line 2 therefore receives the data from the buffer unit 8010.
  • Figure 9b since the line output by the buffer unit 8010 is changed from line 0 of the feature map to line 3 of the feature map, the data on the 0th line of the feature map registered in the calculation register in the register group 8020-8022 is cleared. .
  • FIG. 9b shows the shifting of the 0th column of the 1st row to the 3rd row of the feature map to the calculation register of the register group 8020-8022;
  • Fig. 9c shows the shifting of the 1st row to the 3rd row of the feature map The first column of the 3 rows is shifted to the calculation register of the register group 8020-8022;
  • Fig. 9d shows the shifting of the first row to the second column of the third row of the characteristic map to the calculation register of the register group 8020-8022 middle.
  • the calculation register in the register group 8020-8022 outputs the window corresponding matrix corresponding to the position 4 of the window.
  • FIG. 10 is a flowchart showing a data caching method of an exemplary embodiment of the present disclosure.
  • This method stores the data in the feature map calculated by the neural network in a buffer, where the size of the convolution kernel of the neural network is K*K data, and the window corresponding to the convolution kernel has a step size of S Sliding in the feature map, the buffer includes K buffer units, K is a positive integer, and S is a positive integer.
  • each cache unit a plurality of rows of a feature map are stored, and the plurality of rows includes a corresponding row in every K rows of the feature map.
  • the remainder obtained by dividing the row address by K corresponds to the sequence number of the cache unit storing the row of the feature map.
  • FIG. 11 is a flowchart showing a data caching method of an exemplary embodiment of the present disclosure.
  • This method stores the data in the feature map calculated by the neural network in a buffer, where the size of the convolution kernel of the neural network is K*K data, and the window corresponding to the convolution kernel has a step size of S Sliding in the feature map, the buffer includes K buffer units, K is a positive integer, and S is a positive integer.
  • each cache unit a plurality of rows of the feature map are stored, and the plurality of rows includes a corresponding one of every K rows of the feature map.
  • step S1103 for each of the K register groups, data from the corresponding buffer unit is received, and the stored data is output to the calculation circuit.
  • FIG. 12 is a flowchart showing a data caching method of an exemplary embodiment of the present disclosure.
  • This method stores the data in the feature map calculated by the neural network in a buffer, where the size of the convolution kernel of the neural network is K*K data, and the window corresponding to the convolution kernel has a step size of S Sliding in the feature map, the buffer includes K buffer units, K is a positive integer, and S is a positive integer.
  • each cache unit a plurality of rows of the feature map are stored, and the plurality of rows includes a corresponding one of every K rows of the feature map.
  • the cache controller selects consecutive K lines in the feature map.
  • the buffer controller controls the data in the matrix corresponding to the buffer output window.
  • the cache controller makes the window slide in the row direction in the K rows.
  • the buffer controller controls the buffer to output the last S columns of the matrix corresponding to the window.
  • each of the K rows corresponds to a corresponding one of the corresponding ones of the K cache units, and for each cache unit, the cache controller controls the cache unit to output the corresponding row in the corresponding row column by column address. data.
  • the cache controller selects consecutive K lines starting from the first line address of the feature map; after the window is slid to the end of the K lines, the cache controller reselects the K lines ranked (S+ 1) row to row K and row S after row K, among which, for outputting cache units from row 1 to row S in K rows, the cache controller selects For the next row stored therein, for the cache unit that outputs the (S+1)th row to the Kth row in the K rows, the cache controller still selects the currently selected row; the cache controller controls the cache The device starts from the first column address and outputs the re-selected K rows. The cache controller controls each cache unit to start from the first column address in the selected row and output the selected row.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Complex Calculations (AREA)

Abstract

提供一种数据缓存电路和方法。该电路被配置为缓存用于神经网络所计算的特征图中的数据,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,K为正整数,S为正整数,该电路包括:缓存器,该缓存器包括K个缓存单元,其中每个缓存单元被配置为分别存储该特征图的多个行,该多个行包括该特征图的每K行中的相应一行。

Description

数据缓存电路和方法 技术领域
本公开涉及数据缓存,特别涉及用于神经网络计算的数据缓存。
背景技术
神经网络是人工智能技术的核心。目前,神经网络得到了广泛的研究和关注,应用于包括计算机视觉、语音识别、机器人、自动驾驶等诸多人工智能应用领域中。
在实际应用中,神经网络的层级数量往往非常大,甚至多达上千层,因而神经网络的输入数据和中间数据的数据量也十分庞大。因此,神经网络的数据缓存问题构成了其速度与能效的瓶颈。
在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。
发明内容
根据本公开的一方面,提供一种数据缓存电路,该电路被配置为缓存用于神经网络所计算的特征图中的数据,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,K为正整数,S为正整数,该电路包括:缓存器,该缓存器包括K个缓存单元,其中每个缓存单元被配置为分别存储该特征图的多个行,该多个行包括该特征图的每K行中的相应一行。
根据本公开的另一方面,提供一种数据缓存方法,该方法将用于神经网络所计算的特征图中的数据存储在缓存器中,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,该缓存器包括K个缓存单元,K为正整数,S为正整数,该方法包括:在每个缓存单元中,存储该特征图的多个行,该多个行包括该特征图的每K行中的相应一行。
附图说明
附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。
图1是示出根据示例性实施例的卷积神经网络中卷积层的计算的示意图;
图2a和图2b是示出根据示例性实施例的对应于卷积核的窗口在特征图中滑动的示意图;
图3是示出根据示例性实施例的用于神经网络的计算的系统的结构框图;
图4是示出根据本公开的第一示例性实施例的数据缓存电路的结构框图;
图5是示出根据本公开的第二示例性实施例的数据缓存电路的结构框图;
图6是示出根据本公开的第二示例性实施例的缓存器的示意图;
图7a和图7b是示出根据本公开的第二示例性实施例的寄存器组的数据读取模式和数据移位模式的示意图;
图8a-图8e是示出根据本公开的第二示例性实施例的当神经网络的卷积核在行内滑动时的数据缓存电路的示例操作的示意图;
图9a-图9e是示出根据本公开的第二示例性实施例的当神经网络的卷积核在行间滑动时的数据缓存电路的示例操作的示意图;
图10是示出根据示例性实施例的数据缓存方法的流程图;
图11是示出根据示例性实施例的数据缓存方法的流程图;
图12是示出根据示例性实施例的数据缓存方法的流程图。
具体实施方式
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一些示例中,第一要素和第二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。此外,本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。
在实际应用中,所使用的神经网络可以是深度神经网络(Deep Neural Networks,DNN)。深度神经网络包括输入层、若干隐藏层(中间层)、输出层。输入层接收输入数据(例如,图像的像素数据、音频的振幅数据等),对该输入数据进行预处理(例如,去均值、归一化、主成分分析(Principal components analysis,PCA)降维等),将预处理后的数据传递到隐藏层。若干隐藏层中的每一层接收来自上一层的数据,对接收到的数据进行计算,然后将计算后的数据传递到下一层,其中,隐藏层可以是例如卷积层或池化层。输出层接收来自最后一层隐藏层的数据,对接收到的数据进行计算,然后输出计算结果,其中,输出层可以是例如全连接层。卷积神经网络(Convolutional Neural Network,CNN)是一种深度神经网络,其中隐藏层包含至少一个卷积层。
图1是示出根据示例性实施例的卷积神经网络中卷积层的计算的示意图。如图1所示,将特征图101与卷积核102进行卷积计算,得到输出矩阵103。
根据一些实施例,特征图101是高度为H、宽度为W、通道数为InCh的三维矩阵,该三维矩阵由InCh个高度为H、宽度为W的层构成。H、W和InCh分别为正整数,H和W可以相同也可以不同。例如,图1中的特征图是高度为5、宽度为5、通道数为3的三维矩阵。但是,应当理解,图1仅是示例性的,特征图的高度、宽度、通道数不限于此。根据一些实施例,特征图是输入层或上一隐藏层向该卷积层输入的数据。
为了便于描述,将三维矩阵中每组沿着宽度方向的数据称为三维矩阵的行,将三维矩阵中沿着宽度方向的地址称为列地址;将三维矩阵中的每组沿着高度方向的数据称为三维矩阵的列,将三维矩阵中沿着高度方向的地址称为行地址。但是,应当理解,也可以将三维矩阵中的每组沿着高度方向的数据称为三维矩阵的行,将三维矩阵中的每组沿着宽度方向的数据称为三维矩阵的列。
为了便于描述,规定三维矩阵中的行地址、列地址从地址“0”开始,行地址为i的行为第i行,列地址为j的为第j列。将三维矩阵中的二维地址表示为(行地址,列地址),例如,三维矩阵中行地址为i、列地址为j的数据的二维地址为(i,j)。
根据一些实施例,卷积核102为高度为K、宽度为K、通道数为InCh的三维矩阵,其中,为了进行卷积计算,卷积核102的通道数应当与特征图101的通道数相同。例如,图1中的卷积核为高度为3、宽度为3、通道数为3的三维矩阵。但是,应当理解,图1仅是示例性的,卷积核的高度、宽度、通道数不限于此。另外,虽然图1中的示例仅示出了1个卷积核,但是应当理解,图1仅是示例性的,卷积神经网络中的卷积核的数量不限于此。
为了便于描述,本公开中采用(高度×宽度)来描述特征图和卷积核的大小,例如,图1的特征图大小为5×5个数据,卷积核大小为3×3个数据。
将特征图101与卷积核102进行卷积计算,得到输出矩阵103,图1中使用符号
Figure PCTCN2020080318-appb-000001
来表示卷积计算。特别地,对应于卷积核的窗口在特征图中以步长S沿着高度或宽度的方向滑动,其中,步长S是正整数,且S小于K。在一些实施例中,S可以是1。在另一些实施例中,S可以大于1。在窗口滑动到的每个位置,将窗口所对应的特征图中的数据的三维矩阵与卷积核102进行卷积计算,得到输出矩阵103中的各个元素。将窗口所对应的矩阵即窗口对应矩阵101a与卷积核102进行卷积计算为:将窗口对应矩阵101a与卷积核102中对应位置的元素相乘,然后将所有乘积相加,得到输出矩阵103中的计算结果103a。
根据一些实施例,选中特征图中的K行,窗口在这K行内在行方向或宽度方向上滑动。图2a中示出了窗口在K行内滑动的示意图。为了便于描述,图2a中仅绘制了对应于高度、宽度的二维平面,但是应当理解,窗口对应矩阵是由特征图的所有层上的窗口位置处的数据构成的三维矩阵。如图2a所示,将窗口位置1的窗口对应矩阵与卷积核进行卷积计算后,将窗口沿着宽度方向滑动步长S(在本例子中,S=1)到窗口位置2,然后将窗口位置2的窗口对应矩阵与卷积核进行卷积计算。其中,窗口位置1的窗口对应矩阵的后(K-S)列与窗口位置2的窗口对应矩阵的前(K-S)列重叠。
根据另一些实施例,在窗口已滑动到所述K行的末尾之后,窗口结束在当前K行中的滑动,开始在重新选中的K行中滑动。在本公开中,“窗口已滑动到所述K行的末尾”意味着窗口如果再以步长S继续进行滑动,则会超出特征图的范围。在一些情况下,在窗口已滑动为以使得窗口对应矩阵的最后一列与特征图的最后一列重叠时,窗口已滑动到所述K行的末尾。在另一些情况下,虽然窗口对应矩阵的最后一列尚未与特征图的最后一列重叠,但特征图的最后一列的列地址与窗口对应矩阵的最后一列的列地址之差已小于S,则窗口也已滑动到所述K行的末尾,因为窗口如果再以步长S继续进行滑动,则会超出特征图的范围。“窗口已滑动到所述K行的末尾之后”意味着已完成了窗口到所述K行的末尾的滑动操作以及相应数据输出操作。图2b示出了窗口在行间滑动的示意图。与图2a类似,图2b中也是对应于高度、宽度的二维平面。如图2b所示,当窗口已滑动到所述K行的末尾(对应于当前K行的最后K列的窗口位置3)之后,结束在当前K行中的滑动,并且选中当前K行中的后(K-S)行和当前K行下方的S行作为新的 K行。窗口从原先的窗口位置3移动到对应于新的K行的前K列的窗口位置4,开始在新的K行中的滑动。
虽然图2a、2b中的示例示出了窗口滑动的步长为1,但是应当理解,图2仅是示例性的,卷积神经网络中窗口滑动的步长不限于此。
图3是示出根据示例性实施例的用于神经网络的计算的系统300的结构框图。如图3所示,计算系统300包括数据缓存电路301和计算电路302。其中,数据缓存电路301缓存用于神经网络计算的输入数据,并且将所缓存的数据输出到计算电路302。
根据一些实施例,数据缓存电路301缓存用于神经网络所计算的特征图的数据,计算电路302加载了神经网络的卷积核的数据。根据一些实施例,数据缓存电路301依次向计算电路302输出窗口对应矩阵的数据,而计算电路302对接收到的窗口对应矩阵和已加载的卷积核进行计算,得到输出矩阵中的各个计算结果。
由于数据缓存电路301缓存有特征图的所有数据,因此希望减少特征图所占用的存储空间。本公开通过简化数据缓存电路301中的缓存寻址逻辑来减少特征图所占用的存储空间。
图4是示出根据本公开的第一示例性实施例的数据缓存电路400的结构框图。如图4所示,电路400包括缓存器401和缓存控制器402。
根据一些实施例,将所有窗口位置的窗口对应矩阵分别存储在缓存器401中,缓存控制器402控制缓存器401输出当前的窗口对应矩阵。例如,将图2a中窗口位置1的窗口对应矩阵、窗口位置2的窗口对应矩阵均存储在缓存器401中。当窗口位于窗口位置1时,缓存器401输出窗口位置1的窗口对应矩阵;当窗口位于窗口位置2时,缓存器输出窗口位置2的窗口对应矩阵。由于窗口位置1的窗口对应矩阵与窗口位置2的窗口对应矩阵部分重叠,缓存器401中重复存储了重叠的部分。因此,虽然此时缓存器401的寻址逻辑较为简单,但是重复存储了特征图中的大量数据,造成存储空间浪费。
根据另一些实施例,将特征图对应的三维矩阵存储在缓存器401中,缓存控制器402控制缓存器401依次输出当前窗口位置中每个二维地址对应的数据。例如,当窗口位于图2a的窗口位置1时,依次输出特征图中地址为(1,1)、(1,2)、(1,3)、(2,1)、(2,2)、(2,3)、(3,1)、(3,2)、(3,3)的数据。此时,缓存器401中没有重复存储数据。但是,由于窗口对应矩阵中的数据并不是连续地存储在缓存器401中,此时缓存器401的寻址逻辑较为复杂。
图5是示出根据本公开的第二示例性实施例的数据缓存电路500的结构框图。
根据本公开中的实施例,电路500被配置为缓存用于神经网络所计算的特征图中的数据,其中,该神经网络的卷积核大小为K×K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,K为正整数,S为正整数。电路500包括缓存器501,该缓存器501包括K个缓存单元5010-501(K-1),其中每个缓存单元被配置为分别存储该特征图的多个行,所述多个行包括该特征图的每K行中的相应一行。其中,特征图的每K行为特征图中的连续的K行,例如,特征图的第0行-第(K-1)行、特征图的第1行-第K行。
如图2a所示,当对应于卷积核的窗口在特征图中的每K行中在宽度方向上进行滑动时,窗口对应矩阵为特征图的每K行中的连续的K列。在本公开中,由于每个缓存单元存储了特征图的每K行中的相应一行,特征图的每K行分别存储在K个不同的缓存单元5010-501(K-1)中。因此,仅需要向缓存单元5010-501(K-1)提供列地址,缓存单元5010-501(K-1)即可输出每K行中的一列中的数据,而无需对该列中的所有数据一一寻址,简化了寻址逻辑。
根据一些实施例,特征图的行是特征图中每组沿着宽度方向的数据,特征图的列是特征图中每组沿着高度方向的数据。根据另一些实施例,特征图的行是特征图中每组沿着高度方向的数据,特征图的列是特征图中每组沿着宽度方向的数据。
根据本公开中的实施例,对于所述特征图的每一行,将行地址除以K所得到的余数对应于存储所述特征图的该行的缓存单元的序号。
图6示出了根据一些实施例的缓存器501中的缓存单元5010-501(K-1)所存储的数据。如图6所示,对于特征图中的第0行,由于0除以K所得的余数为0,将特征图的第0行存储在缓存单元5010中;对于特征图中的第K行,由于K除以K所得的余数也为0,将特征图的第K行存储在缓存单元5010中。类似地,将特征图的第1行、第(K+1)行存储在缓存单元5011中,将特征图的第(K-1)行和第(2K-1)行存储在缓存单元501(K-1)中。
如参考图2a所描述的,对应于卷积核的窗口在特征图中连续的K行中在宽度方向上进行滑动。假设连续的K行中的首行的行地址为i,则连续的K行的行地址为i、i+1、…i+(K-1)。因此,连续的K行的行地址除以K所得到的余数例如可以分别依次为0、1、…、(K-1),或者例如可以依次分别为q、1+q、…、(K-1)、0、1、…、(q-1),其中q是小于K的正整数,是i除以K的余数。由于行地址除以K所得到的余数对应于存储该行的缓存单元的序号,因此特征图中连续的K行分别存储在K个缓存单元中。
根据一些实施例,根据特征图的大小来设计缓存单元的容量。高度为H、宽度为W、通道数为InCh的特征图具有H行,每行包含特征图中的W个数据,即(W*InCh)个数据。规定H除以K的整数商为M,则缓存单元5010中存储了M行,即存储了特征图的(M*W*InCh)个数据。缓存单元的容量应设计为足以储存(M*W*InCh)个数据。
根据本公开中的实施例,电路500还包括K个寄存器组5020-502(K-1),每个寄存器组被配置为接收来自对应的缓存单元的数据,并且将所存储的数据输出到计算电路。
根据一些实施例,寄存器组5020-502(K-1)向计算电路输出当前窗口对应矩阵的数据,其中,每个寄存器组输出窗口对应矩阵中的对应行的数据。特别地,寄存器组5020输出窗口对应矩阵中的第0行数据,寄存器组5021输出窗口对应矩阵中的第1行数据,…,寄存器组502(K-1)输出窗口对应矩阵中的第(K-1)行数据。
根据本公开中的实施例,电路500还包括缓存控制器503,缓存控制器503被配置为:选中该特征图中连续的K行;控制缓存器501以输出窗口所对应的矩阵中的数据;使窗口在该K行中沿行方向滑动;以及在窗口滑动后,控制缓存器501以输出该窗口所对应的矩阵的后S列。
根据一些实施例,当窗口开始在所选中的K行中滑动时,窗口位于该K行中的第0列-第(K-1)列,缓存控制器503控制缓存器501输出该K行中第0列-第(K-1)列中的数据。根据另一些实施例,在窗口每次滑动后,如上面参考图2a所述,滑动前的窗口对应矩阵的后(K-S)列与滑动后的窗口对应矩阵的前(K-S)列重叠,因此,缓存控制器503仅需要控制缓存器501再输出窗口对应矩阵中不重叠的后S列中的数据。
根据本公开中的实施例,其中,该K行中的每一行对应于该K个缓存单元中的相应缓存单元中的相应一行,缓存控制器503还被配置为:对于每个缓存单元,控制该缓存单元以按照列地址逐列输出该相应一行中的数据。如上所述,特征图中连续的K行分别存储在K个缓存单元中,因此,对于每个缓存单元,缓存控制器503分别选中其中所存储的一行。
根据一些实施例,缓存控制器503控制所有缓存单元5010-501(K-1)同时输出同一列地址的数据。由于每个缓存单元输出所选中的K行中的一行,所以当所有缓存单元5010-501(K-1)同时输出同一列地址的数据时,缓存器501输出所选中的K行中的一列。
当窗口在所选中的K行中滑动时,缓存器501仅输出滑动前后的窗口对应矩阵中不重叠的后S列的数据,因此,在每次窗口滑动后,每个缓存单元都从当前的列地址开始,继续输出窗口对应矩阵的后S列数据,而无需回到窗口对应矩阵的第一列。因而,当窗 口在所选中的K行中滑动时,缓存单元按照列地址逐列输出所选中的行中的数据,而无需在窗口每次滑动后回到当前列地址之前的地址。
对于缓存器而言,当乱序地读取其中所存储的数据时,寻址逻辑复杂,数据读取速度较慢;而当顺序地输出其中所存储的数据时,寻址逻辑简单,数据读取速度加快。由于缓存单元顺序地输出所选中的行中的数据,简化了寻址逻辑,提高了数据读取速度。
根据一些实施例,如上面参考图2b所述,在该窗口滑动到特征图的当前选中的K行的末尾之后,窗口结束在当前选中的K行中的滑动,开始在特征图的重新选中的K行中滑动,其中,原有的K行中的后(K-S)行与重新选中的K行中的前(K-S)行重叠。
根据本公开中的实施例,缓存控制器503还被配置为:从该特征图的第1个行地址开始,选中连续的K行;在该窗口滑动到该K行的末尾之后,重新选中该K行中的排在第(S+1)的行至排在第K的行以及排在该K行之后的S行;控制该缓存器从第1个列地址开始,输出重新选中的K行。
根据本公开中的实施例,缓存控制器503还被配置为:在该窗口滑动到该K行的末尾之后,对于输出该K行中的排在第1的行至排在第S的行的缓存单元,选中其中所存储的下一行,对于输出该K行中的排在第(S+1)的行至排在第K的行的缓存单元,仍然选中当前所选中的行;控制每个缓存单元以从所选中的行中的第1个列地址开始,输出该所选中的行。
例如,当窗口在特征图的第0行-第(K-1)行中滑动时,缓存单元5010-501(S-1)输出该K行中的排在第1的行至排在第S的行,即,特征图的第0行-第(S-1)行,缓存单元501S-501(K-1)输出该K行中的排在第(S+1)的行至排在第K的行,即,特征图的第S行-第(K-1)行。当该窗口滑动到该K行的末尾之后,对于缓存单元5010,选中其中所存储的下一行,即,特征图的第K行,类似地,对于缓存单元5011,选中特征图的第(K+1)行,…,对于缓存单元501(S-1),选中特征图的第(K+S-1)行。
根据本公开中的实施例,电路500还包括复用器504,复用器504被配置为将来自每个缓存单元的数据传递到对应的寄存器组。如上所述,每个寄存器组接收窗口对应矩阵中的对应行中的数据并将其输出到计算电路。然而,当窗口在行间滑动时,每个缓存单元的所选中的行对应于窗口对应矩阵中的不同行,因此应当将来自该缓存单元的数据传递到不同的寄存器组。
根据一些实施例,随着窗口在特征图的行间滑动,复用器504相应地改变输出数据的缓存单元与接收数据的寄存器组的对应关系。例如,当窗口在特征图的第0行-第(K-1) 行中滑动时,缓存单元5010输出特征图的第0行,即,窗口对应矩阵的第0行,此时,复用器504将来自缓存单元5010的数据传递到寄存器组5020;当窗口在特征图的第S行-第(K+S-1)行中滑动时,缓存单元5010输出特征图的第K行,即,窗口对应矩阵的第(K-S)行,此时,复用器504将来自缓存单元5010的数据传递到寄存器组502(K-S)。
根据本公开中的实施例,缓存器501包括随机存取存储器RAM。
图7a-图7b是示出根据本公开的第二示例性实施例的寄存器组的数据读取模式和数据移位模式的示意图。根据本公开中的实施例,每个寄存器组700包括写寄存器701,被配置为接收来自对应的缓存单元的数据;以及计算寄存器702,被配置为接收来自写寄存器701的数据,并且将所寄存的数据输出到计算电路。
根据本公开中的实施例,当寄存器组700处于数据读取模式时,如图7a所示,写寄存器701接收来自对应的缓存单元的数据,计算寄存器702将所寄存的数据输出到计算电路;当寄存器组700处于数据移位模式时,如图7b所示,写寄存器701将所寄存的数据移位到计算寄存器702中。
根据一些实施例,寄存器组700交替处于数据读取模式和数据移位模式。当寄存器组700处于数据读取模式时,计算寄存器702将当前窗口对应矩阵的对应行输出到计算电路,此时计算寄存器702中所寄存的数据不变;写寄存器701接收滑动前后窗口对应矩阵的对应行中的不重叠部分,即,滑动后的窗口对应矩阵的对应行中的后S列。当寄存器组700处于数据移位模式时,写寄存器701将在数据读取模式中接收到的数据移位到计算寄存器702,计算寄存器702中的数据更新为滑动后的窗口对应矩阵的对应行的数据。由于在计算寄存器702计算电路输出当前窗口对应矩阵的同时,写寄存器701接收来自缓存单元的数据,并且仅需要接收滑动前后窗口对应矩阵的对应行中的不重叠部分的数据,减小了访问缓存器的延迟。
根据本公开中的实施例,计算寄存器702包括:K个寄存器单元7021-702K,该K个寄存器单元中的最后一个寄存器单元702K被配置为接收来自写寄存器701的数据。其中,响应于接收到来自写寄存器701的数据,该K个寄存器单元中的后(K-1)个单元7022-702K中的每一个将其中所寄存的数据移位到前一个寄存器单元。
根据一些实施例,当窗口开始在特征图的K行内滑动时,写寄存器701将窗口对应矩阵中的对应行中的数据按照列地址依次移位到寄存器单元7021-702K中。特别地,在第1时刻,写寄存器701将对应行中的第0列中的数据移位到寄存器单元702K中;在第2时刻,写寄存器701将对应行中的第1列中的数据移位到寄存器单元702K中,寄 存器单元702K将对应行中的第0列数据移位到寄存器单元702(K-1)中;…;在第K时刻,写寄存器701将对应行中的第(K-1)列数据移位到寄存器单元702K中,寄存器单元7022-702K也将其中所寄存的数据移位到上一个寄存器单元,此时,寄存器单元7021寄存了对应行的第0列,寄存器单元7022寄存了对应行的第1列,…,寄存器单元702K寄存了对应行的第(K-1)列。
根据另一些实施例,当窗口在特征图的K行内滑动时,写寄存器701将滑动后的窗口对应矩阵中的对应行的后S列中的数据按照列地址依次移位到寄存器单元702(K-S+1)-702K中,而将原先寄存器单元702(K-S+1)-702K中所寄存的数据移位到寄存器单元7021-702(K-S)中。
根据本公开中的实施例,在任何缓存单元所输出特征图的行改变的情况下,清空每个寄存器组中的计算寄存器。如上所述,随着窗口在特征图的行间滑动,S个缓存单元改变所输出的行,而每个寄存器组所输出的行也发生变化。因此,在窗口开始在新的K行中滑动时,清空计算寄存器中的滑动前所输出的行的数据,以寄存滑动后的所输出行的数据。
根据本公开中的实施例,特征图中的每个数据包括所有通道上的二维地址相同的数据,其中,二维地址包括每个数据的行地址、列地址。
根据本公开中的实施例,计算电路为向量-矩阵乘法计算电路或者存算一体电路。
图8a-图8e是示出根据本公开的第二示例性实施例的当神经网络的卷积核在行内滑动时的数据缓存电路的示例操作的示意图。其中,特征图的大小为5×5个数据,卷积核的大小为3×3个数据,窗口滑动的步长为1。
如图8a-图8e所示,此时窗口在特征图的第0行-第2行中滑动,此时,缓存单元8010输出特征图的第0行,缓存单元8011输出特征图的第1行,缓存单元8012输出特征图的第2行。寄存器组8020输出窗口对应矩阵的第0行,因此接收来自缓存单元8010的数据;寄存器组8021输出窗口对应矩阵的第1行,因此接收来自缓存单元8011的数据;寄存器组8022输出窗口对应矩阵的第2行,因此接收来自缓存单元8012的数据。
如图8a-8c所示,当窗口开始在特征图的第0行-第2行中滑动时,窗口位于窗口位置1处,窗口位置1对应于特征图的第0行-第2行的第0列-第2列。如图8a所示,寄存器组8020-8022均处于数据移位模式。在每个寄存器组中,写寄存器寄存有上次数据读取模式时从对应缓存单元接收到的数据,即,特征图的对应行的第0列。此时,写寄存器将其中所寄存的特征图的对应行的第0列移位到计算寄存器中。类似地,在图8b中, 在每个寄存器组中,写寄存器将其中所寄存的特征图的对应行的第1列移位到计算寄存器中,计算寄存器中寄存有特征图的对应行的第0列、第1列;在图8c中,每个寄存器组中,写寄存器将其中所寄存的特征图的对应行的第2列移位到计算寄存器中,计算寄存器中寄存有特征图的对应行的第0列-第2列。
如图8d所示,窗口已从窗口位置1滑动到窗口位置2,窗口位置2对应于特征图的第0行-第2行的第1列-第3列。此时,寄存器组8020-8022均处于数据读取模式。在每个寄存器组中,计算寄存器输出窗口位置1的窗口对应矩阵的对应行,写寄存器则接收到特征图的对应行的第3列数据。
如图8e所示,窗口位于窗口位置2。此时,寄存器组8020-8022均处于数据移位模式。在每个寄存器组中,写寄存器将特征图的对应行的第3列中的数据移位到计算寄存器中,而计算寄存器中的寄存单元也依次将其中所寄存的数据移位到前一个寄存器单元。经过数据移位,寄存器组8020-8022中所寄存的数据更新为滑动后的窗口位置2的窗口对应矩阵的数据。
图9a-图9e是示出根据本公开的第二示例性实施例的当神经网络的卷积核在行间滑动时的数据缓存电路的示意图。
如图9a-图9e所示,在窗口位置3,窗口结束在窗口第0行-第2行中的滑动;在窗口位置4,窗口开始在第1行-第3行中的滑动。
如图9a所示,窗口位于窗口位置3,窗口位置3对应于特征图的第0行-第2行的第2列-第4列。此时,如参考图8a-图8c所述,寄存器组8020接收来自缓存单元8010的数据;寄存器组8021接收来自缓存单元8011的数据;寄存器组8022接收来自缓存单元8012的数据。
如图9b-图9e所示,窗口已从窗口位置3滑动到窗口位置4。窗口位置4对应于特征图的第1行-第3行的第0列-第2列。缓存单元8010输出特征图的第3行,缓存单元8011输出特征图的第1行,缓存单元8012输出特征图的第2行。寄存器组8020输出窗口对应矩阵的第0行,因此接收来自缓存单元8011的数据;寄存器组8021输出窗口对应矩阵的第1行,因此接收来自缓存单元8012的数据;寄存器组8022输出窗口对应矩阵的第2行,因此接收来自缓存单元8010的数据。其中,在图9b中,由于缓存单元8010输出的行从特征图的0行改变为特征图的第3行,清空寄存器组8020-8022中的计算寄存器中所寄存的特征图第0行的数据。
在图9b-图9d中,如参考图8a-图8c所述,当窗口开始在特征图第1行-第3行中滑动时,缓存单元8010-8012将窗口位置4的窗口对应矩阵逐列输出到寄存器组8020-8022。特别地,图9b示出了将特征图的第1行-第3行的第0列移位到寄存器组8020-8022的计算寄存器中;图9c示出了将特征图的第1行-第3行的第1列移位到寄存器组8020-8022的计算寄存器中;图9d示出了将特征图的第1行-第3行的第2列移位到寄存器组8020-8022的计算寄存器中。
在图9e中,如参考图8d所述,寄存器组8020-8022中的计算寄存器输出窗口位置4对应的窗口对应矩阵。
图10是示出本公开的示例性实施例的数据缓存方法的流程图。该方法将用于神经网络所计算的特征图中的数据存储在缓存器中,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,该缓存器包括K个缓存单元,K为正整数,S为正整数。
在步骤S1001处,在每个缓存单元中,存储特征图的多个行,该多个行包括该特征图的每K行中的相应一行。根据一些实施例,对于该特征图的每一行,将行地址除以K所得到的余数对应于存储该特征图的该行的缓存单元的序号。
图11是示出本公开的示例性实施例的数据缓存方法的流程图。该方法将用于神经网络所计算的特征图中的数据存储在缓存器中,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,该缓存器包括K个缓存单元,K为正整数,S为正整数。
在步骤S1101处,在每个缓存单元中,存储特征图的多个行,该多个行包括特征图的每K行中的相应一行。
在步骤S1103处,对于K个寄存器组中的每个寄存器组,接收来自对应的缓存单元的数据,并且将所存储的数据输出到计算电路。
图12是示出本公开的示例性实施例的数据缓存方法的流程图。该方法将用于神经网络所计算的特征图中的数据存储在缓存器中,其中,该神经网络的卷积核大小为K*K个数据,对应于该卷积核的窗口以步长S在该特征图中滑动,该缓存器包括K个缓存单元,K为正整数,S为正整数。
在步骤S1201处,在每个缓存单元中,存储特征图的多个行,该多个行包括特征图的每K行中的相应一行。
在步骤S1203处,缓存控制器选中特征图中连续的K行。
在步骤S1205处,缓存控制器控制缓存器输出窗口所对应的矩阵中的数据。
在步骤S1207处,缓存控制器使窗口在该K行中沿行方向滑动;以及
在步骤S1209处,在窗口每次滑动后,缓存控制器控制缓存器以输出窗口对应的矩阵的后S列。根据一些实施例,K行中的每一行对应于K个缓存单元中的相应缓存单元中的相应一行,对于每个缓存单元,缓存控制器控制缓存单元以按照列地址逐列输出相应一行中的数据。
根据一些实施例,缓存控制器从特征图的第1个行地址开始,选中连续的K行;在窗口滑动到K行的末尾之后,缓存控制器重新选中K行中的排在第(S+1)的行至排在第K的行以及排在K行之后的S行,其中,对于输出K行中的排在第1的行至排在第S的行的缓存单元,缓存控制器选中其中所存储的下一行,对于输出K行中的排在第(S+1)的行至排在第K的行的缓存单元,缓存控制器仍然选中当前所选中的行;缓存控制器控制缓存器从第1个列地址开始,输出重新选中的K行,其中,缓存控制器控制每个缓存单元从所选中的行中的第1位列地址开始,输出所选中的行。
还应该理解,可以根据具体要求而进行各种变型。例如,也可以使用定制硬件,和/或可以用硬件、固件、中间件、微代码,硬件描述语言或其任何组合来实现特定元件。例如,所公开的方法和设备中的一些或全部可以通过使用根据本公开的电路原理和方法,用汇编语言或硬件编程语言(诸如VERILOG,VHDL,C++)对硬件(例如,包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA)的可编程逻辑电路)进行编程来实现。
虽然已经参照附图描述了本公开的实施例或示例,但应理解,上述的方法、系统和设备仅仅是示例性的实施例或示例,本发明的范围并不由这些实施例或示例限制,而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外,可以通过不同于本公开中描述的次序来执行各步骤。进一步地,可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进,在此描述的很多要素可以由本公开之后出现的等同要素进行替换。

Claims (28)

  1. 一种数据缓存电路,所述电路被配置为缓存用于神经网络所计算的特征图中的数据,其中,所述神经网络的卷积核大小为K*K个数据,对应于所述卷积核的窗口以步长S在所述特征图中滑动,K为正整数,S为正整数,所述电路包括:
    缓存器,所述缓存器包括K个缓存单元,其中每个缓存单元被配置为分别存储所述特征图的多个行,所述多个行包括所述特征图的每K行中的相应一行。
  2. 如权利要求1所述的电路,其中,对于所述特征图的每一行,将行地址除以K所得到的余数对应于存储所述特征图的该行的缓存单元的序号。
  3. 如权利要求1所述的电路,还包括:
    K个寄存器组,其中,每个所述寄存器组被配置为接收来自对应的缓存单元的数据,并且将所存储的数据输出到计算电路。
  4. 如权利要求1所述的电路,还包括缓存控制器,所述缓存控制器被配置为:
    选中所述特征图中连续的K行;
    控制所述缓存器以输出窗口所对应的矩阵中的数据;
    使所述窗口在所述K行中沿行方向滑动;以及
    在所述窗口滑动后,控制所述缓存器以输出所述窗口所对应的矩阵的后S列。
  5. 如权利要求4所述的电路,其中,所述K行中的每一行对应于所述K个缓存单元中的相应缓存单元中的相应一行,所述缓存控制器还被配置为:
    对于每个缓存单元,控制所述缓存单元以按照列地址逐列输出所述相应一行中的数据。
  6. 如权利要求5所述的电路,所述缓存控制器还被配置为:
    从所述特征图的第1个行地址开始,选中连续的K行;
    在所述窗口滑动到所述K行的末尾之后,重新选中所述K行中的排在第(S+1)的行至排在第K的行以及排在所述K行之后的S行;
    控制所述缓存器从第1个列地址开始,输出重新选中的K行。
  7. 如权利要求5所述的电路,其中,所述缓存控制器还被配置为:
    在所述窗口滑动到所述K行的末尾之后,对于输出所述K行中的排在第1的行至排在第S的行的缓存单元,选中其中所存储的下一行,对于输出所述K行中的排在第(S+1)的行至排在第K的行的缓存单元,仍然选中当前所选中的行;
    控制每个缓存单元以从所选中的行中的第1个列地址开始,输出所述所选中的行。
  8. 如权利要求3所述的电路,还包括:
    复用器,被配置为将来自每个缓存单元的数据传递到对应的寄存器组。
  9. 如权利要求3所述的电路,其中,每个寄存器组包括:
    写寄存器,被配置为接收来自对应的缓存单元的数据;以及
    计算寄存器,被配置为接收来自所述写寄存器的数据,并且将所寄存的数据输出到所述计算电路。
  10. 如权利要求9所述的电路,其中,每个寄存器组被配置为:
    当该寄存器组处于数据读取模式时,所述写寄存器接收来自对应的缓存单元的数据,所述计算寄存器将所寄存的数据输出到所述计算电路;以及
    当该寄存器组处于数据移位模式时,所述写寄存器将所寄存的数据移位到所述计算寄存器。
  11. 如权利要求9所述的电路,其中,所述计算寄存器包括:
    K个寄存器单元,所述K个寄存器单元中的最后一个寄存器单元被配置为接收来自所述写寄存器的数据,
    其中,响应于接收到来自所述写寄存器的数据,所述K个寄存器单元中的后(K-1)个中的每一个将其中所寄存的数据移位到前一个寄存器单元。
  12. 如权利要求9所述的电路,其中,在任何缓存单元所输出的行改变的情况下,清空每个寄存器组中的所述计算寄存器。
  13. 如权利要求1所述的电路,其中,所述特征图中的每个数据包括所有通道上的二维地址相同的数据,其中,每个数据的二维地址包括该数据的行地址、列地址。
  14. 如权利要求1所述的电路,其中,所述缓存器包括随机存取存储器RAM。
  15. 如权利要求3所述的电路,其中,所述计算电路为向量-矩阵乘法计算电路或者存算一体电路。
  16. 一种数据缓存方法,所述方法将用于神经网络所计算的特征图中的数据存储在缓存器中,其中,所述神经网络的卷积核大小为K*K个数据,对应于所述卷积核的窗口以步长S在所述特征图中滑动,所述缓存器包括K个缓存单元,K为正整数,S为正整数,所述方法包括:
    在每个缓存单元中,存储所述特征图的多个行,所述多个行包括所述特征图的每K行中的相应一行。
  17. 如权利要求16所述的方法,其中,
    对于所述特征图的每一行,将行地址除以K所得到的余数对应于存储所述特征图的该行的缓存单元的序号。
  18. 如权利要求16所述的方法,所述方法还包括:
    对于K个寄存器组中的每个寄存器组,接收来自对应的缓存单元的数据,并且将所存储的数据输出到计算电路。
  19. 如权利要求16所述的方法,所述方法还包括:
    缓存控制器选中所述特征图中连续的K行;
    所述缓存控制器控制所述缓存器输出窗口所对应的矩阵中的数据;
    所述缓存控制器使所述窗口在所述K行中沿行方向滑动;以及
    在所述窗口每次滑动后,所述缓存控制器控制所述缓存器以输出所述窗口对应的矩阵的后S列。
  20. 如权利要求19所述的方法,其中,所述K行中的每一行对应于所述K个缓存单元中的相应缓存单元中的相应一行,所述方法还包括:
    对于每个缓存单元,所述缓存控制器控制所述缓存单元以按照列地址逐列输出所述相应一行中的数据。
  21. 如权利要求20所述的方法,所述方法还包括:
    所述缓存控制器从所述特征图的第1个行地址开始,选中连续的K行;
    在所述窗口滑动到所述K行的末尾之后,所述缓存控制器重新选中所述K行中的排在第(S+1)的行至排在第K的行以及排在所述K行之后的S行;
    所述缓存控制器控制所述缓存器从第1个列地址开始,输出重新选中的K行。
  22. 如权利要求20所述的方法,所述方法还包括:
    在所述窗口滑动到所述K行的末尾之后,对于输出所述K行中的排在第1的行至排在第S的行的缓存单元,所述缓存控制器选中其中所存储的下一行,对于输出所述K行中的排在第(S+1)的行至排在第K的行的缓存单元,所述缓存控制器仍然选中当前所选中的行;
    所述缓存控制器控制每个缓存单元从所选中的行中的第1位列地址开始,输出所述所选中的行。
  23. 如权利要求18所述的方法,所述方法还包括:
    复用器将来自每个缓存单元的数据传递到对应的寄存器组。
  24. 如权利要求18所述的方法,其中每个寄存器组包括写寄存器和计算寄存器,所述方法还包括:
    所述写寄存器接收来自对应的缓存单元的数据;
    所述计算寄存器接收来自所述写寄存器的数据,并且将所寄存的数据输出到所述计算电路。
  25. 如权利要求24所述的方法,所述方法还包括:
    当该寄存器组处于数据读取模式时,所述写寄存器接收来自对应的缓存单元的数据,所述计算寄存器将所寄存的数据输出到所述计算电路;以及
    当该寄存器组处于数据移位模式时,所述写寄存器将所寄存的数据移位到所述计算寄存器。
  26. 如权利要求24所述的方法,其中,所述计算寄存器包括K个寄存器单元,所述方法还包括:
    所述K个寄存器单元中的最后一个寄存器单元接收来自所述写寄存器的数据,
    其中,响应于接收到来自所述写寄存器的数据,所述K个寄存器单元中的后(K-1)个中的每一个将其中所寄存的数据移位到前一个寄存器单元。
  27. 如权利要求24所述的方法,所述方法还包括:
    在任何缓存单元所输出的行改变的情况下,清空每个寄存器组中的所述计算寄存器。
  28. 如权利要求16所述的方法,其中,所述特征图中的每个数据包括所有通道上的二维地址相同的数据,其中,每个数据的二维地址包括该数据的行地址、列地址。
PCT/CN2020/080318 2020-02-26 2020-03-20 数据缓存电路和方法 WO2021168944A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/849,913 US11216375B2 (en) 2020-02-26 2020-04-15 Data caching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010118620.0A CN113313228B (zh) 2020-02-26 2020-02-26 数据缓存电路和方法
CN202010118620.0 2020-02-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/849,913 Continuation US11216375B2 (en) 2020-02-26 2020-04-15 Data caching

Publications (1)

Publication Number Publication Date
WO2021168944A1 true WO2021168944A1 (zh) 2021-09-02

Family

ID=77370142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/080318 WO2021168944A1 (zh) 2020-02-26 2020-03-20 数据缓存电路和方法

Country Status (2)

Country Link
CN (1) CN113313228B (zh)
WO (1) WO2021168944A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629406A (zh) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 用于卷积神经网络的运算装置
CN108805266A (zh) * 2018-05-21 2018-11-13 南京大学 一种可重构cnn高并发卷积加速器
CN109214506A (zh) * 2018-09-13 2019-01-15 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
US20190205735A1 (en) * 2017-12-29 2019-07-04 Facebook, Inc. Lowering hardware for neural networks
CN110390384A (zh) * 2019-06-25 2019-10-29 东南大学 一种可配置的通用卷积神经网络加速器
CN110705687A (zh) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 卷积神经网络硬件计算装置及方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6771018B2 (ja) * 2015-07-23 2020-10-21 マイヤプリカ テクノロジー エルエルシー 二次元配列プロセッサの性能向上
US10417560B2 (en) * 2016-12-01 2019-09-17 Via Alliance Semiconductor Co., Ltd. Neural network unit that performs efficient 3-dimensional convolutions
EP3557484B1 (en) * 2016-12-14 2021-11-17 Shanghai Cambricon Information Technology Co., Ltd Neural network convolution operation device and method
US10990648B2 (en) * 2017-08-07 2021-04-27 Intel Corporation System and method for an optimized winograd convolution accelerator
CN108182471B (zh) * 2018-01-24 2022-02-15 上海岳芯电子科技有限公司 一种卷积神经网络推理加速器及方法
CN108681984B (zh) * 2018-07-26 2023-08-15 珠海一微半导体股份有限公司 一种3*3卷积算法的加速电路
CN109934339B (zh) * 2019-03-06 2023-05-16 东南大学 一种基于一维脉动阵列的通用卷积神经网络加速器

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629406A (zh) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 用于卷积神经网络的运算装置
US20190205735A1 (en) * 2017-12-29 2019-07-04 Facebook, Inc. Lowering hardware for neural networks
CN108805266A (zh) * 2018-05-21 2018-11-13 南京大学 一种可重构cnn高并发卷积加速器
CN109214506A (zh) * 2018-09-13 2019-01-15 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
CN110390384A (zh) * 2019-06-25 2019-10-29 东南大学 一种可配置的通用卷积神经网络加速器
CN110705687A (zh) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 卷积神经网络硬件计算装置及方法

Also Published As

Publication number Publication date
CN113313228B (zh) 2022-10-14
CN113313228A (zh) 2021-08-27

Similar Documents

Publication Publication Date Title
Meng et al. Ar-net: Adaptive frame resolution for efficient action recognition
CN110705687B (zh) 卷积神经网络硬件计算装置及方法
Deng et al. DrAcc: A DRAM based accelerator for accurate CNN inference
CN108304922B (zh) 用于神经网络计算的计算设备和计算方法
US11017264B2 (en) Method and apparatus with dilated convolution
US10936937B2 (en) Convolution operation device and convolution operation method
EP0422348A2 (en) Two-dimensional systolic array for neural networks, and method
US20230289230A1 (en) Method and apparatus for accelerating convolutional neural network
US11709911B2 (en) Energy-efficient memory systems and methods
US10734448B2 (en) Convolutional neural network system employing resistance change memory cell array
CN110766127B (zh) 神经网络计算专用电路及其相关计算平台与实现方法
CN108717571B (zh) 一种用于人工智能的加速方法和装置
US20190385005A1 (en) Framebuffer-less system and method of convolutional neural network
CN114761925A (zh) 处理元件阵列的高效利用
TWI764081B (zh) 組合多個全局描述符以用於圖像檢索的框架
US11941872B2 (en) Progressive localization method for text-to-video clip localization
CN112926731A (zh) 执行神经网络的矩阵乘法运算的装置和方法
Shi et al. Anchor-based self-ensembling for semi-supervised deep pairwise hashing
WO2021168944A1 (zh) 数据缓存电路和方法
CN108764182B (zh) 一种优化的用于人工智能的加速方法和装置
WO2022007265A1 (zh) 一种膨胀卷积加速计算方法及装置
WO2021188262A1 (en) Processing in memory methods for convolutional operations
US20210390379A1 (en) Data Loading
US20190164035A1 (en) Device for reorganizable neural network computing
JP6938698B2 (ja) イメージ検索のためのマルチグローバルディスクリプタを組み合わせるフレームワーク

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921169

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20921169

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 19/04/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20921169

Country of ref document: EP

Kind code of ref document: A1