WO2022027172A1 - Data processing apparatus, method, and system, and neural network accelerator - Google Patents

Data processing apparatus, method, and system, and neural network accelerator Download PDF

Info

Publication number
WO2022027172A1
WO2022027172A1 PCT/CN2020/106556 CN2020106556W WO2022027172A1 WO 2022027172 A1 WO2022027172 A1 WO 2022027172A1 CN 2020106556 W CN2020106556 W CN 2020106556W WO 2022027172 A1 WO2022027172 A1 WO 2022027172A1
Authority
WO
WIPO (PCT)
Prior art keywords
input feature
data
cache
feature data
loading
Prior art date
Application number
PCT/CN2020/106556
Other languages
French (fr)
Chinese (zh)
Inventor
李鹏
韩峰
杨康
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/106556 priority Critical patent/WO2022027172A1/en
Publication of WO2022027172A1 publication Critical patent/WO2022027172A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and in particular, to a data processing apparatus, method and system, and a neural network accelerator.
  • Some neural networks often contain a lot of convolution processing.
  • convolution processing it is necessary to load the input feature data and convolution kernel parameters into the pulsation processing array in the neural network, and then use the pulsation processing array to combine the input feature data with convolution.
  • the kernel parameters are calculated to obtain the output feature data.
  • the processing units in the systolic processing array are idle, so that the processing efficiency of the systolic processing array is low.
  • the embodiments of the present disclosure propose a data processing method, apparatus and system, and a neural network accelerator to solve the technical problem of low processing efficiency of a pulsating processing array in the related art.
  • a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, the plurality of first loading units Each of the first loading units is used to: access a storage unit in parallel to read input feature data from the storage unit, cache the read input feature data, and cache the cached input feature data Loaded into at least one row of processing elements in a systolic processing array.
  • a data processing system comprising the data processing apparatus described in any one of the embodiments; and a systolic processing array for loading the input feature data and convolution kernel parameters, and for The input feature data and the convolution kernel parameters are subjected to convolution processing.
  • a neural network accelerator where the neural network accelerator includes the data processing apparatus described in any embodiment, or includes the data processing system described in any embodiment.
  • a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array includes: accessing a storage unit in parallel by each first loading unit in the plurality of first loading units, and reading the input feature data from the storage unit; buffering the data; and loading the buffered input feature data into at least one row of processing units in the systolic processing array.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the embodiments of the present disclosure.
  • a data processing apparatus including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implements the present disclosure when executing the program Steps performed by any of the first loading units in the method of any of the embodiments.
  • a plurality of first loading units acquire the input feature data from the storage unit in parallel, and the number of valid data acquired by the systolic processing array is doubled compared to the way of acquiring data through only one loading unit Therefore, the idle situation of the processing units in the pulsating processing array is reduced, and the processing efficiency of the pulsating processing unit is improved.
  • FIG. 1 is a schematic diagram of a processing manner of a systolic processing array according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram of a data loading manner according to an embodiment of the present disclosure.
  • FIG. 3 is a schematic diagram of a data flow process in a systolic processing array according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a data read and load process according to an embodiment of the present disclosure.
  • 5A and 5B are schematic diagrams of valid data read from a memory cell according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.
  • FIGS. 7A and 7B are schematic diagrams illustrating changes in the amount of cached data during a data loading process according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a first loading unit according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of instantiating multiple cache units according to an embodiment of the present disclosure.
  • FIG. 10 is a processing flowchart of a data processing apparatus according to an embodiment of the present disclosure.
  • Figure 11A is a schematic diagram of the length of a conventional systolic processing array.
  • 11B is a schematic diagram of the length of a systolic processing array in accordance with an embodiment of the present disclosure.
  • FIG. 12 is a schematic diagram of a computer device according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of a memory cell according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram of a data processing system according to an embodiment of the present disclosure.
  • 15A and 15B are respectively schematic diagrams of neural network accelerators according to embodiments of the present disclosure.
  • FIG. 16 is a flowchart of a data processing method according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure.
  • word "if” as used herein can be interpreted as "at the time of” or "when” or "in response to determining.”
  • a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, each of the plurality of first loading units A first loading unit can access a storage unit in parallel to read input feature data from the storage unit, buffer the read input feature data, and can load the buffered input feature data into a systolic process At least one row of processing units in the array.
  • the input feature data here can be an input feature map (Input Feature Map, IFM), which can come from images, voice or text, etc.; correspondingly, the output feature data is an output feature map (Output Feature Map, OFM), which can be converted for image, voice or text.
  • IFM input feature map
  • OFM output feature map
  • the systolic processing array is a simple and efficient processing device.
  • the input feature data is multiplexed in the systolic processing array, which can reduce the input bandwidth requirement for the input feature data.
  • a common systolic processing array is rectangular, including R ⁇ H processing units, that is, each row has H processing units, and each column has R processing units, and R and H may or may not be equal.
  • R ⁇ H processing units that is, each row has H processing units, and each column has R processing units, and R and H may or may not be equal.
  • the processing unit in the i-th row and the j-th column is used to input the input characteristic data obtained in this clock cycle (cycle) to the processing unit in the i-th row and the j+1th column in the next clock cycle, and is used to send the operation result.
  • the processing unit does not necessarily process each acquired input feature data.
  • the processing unit in row 1 and column 1 transmits the input feature data in row 1 and column 1 to the first After the processing unit in row 2 and column 2, the processing unit in row 1 and column 2 does not multiply the input feature data with the convolution kernel parameters obtained by this processing unit, but directly passes it to row 1 and 3 The processing unit of the column.
  • Each processing unit can be used to perform multiplication and addition operations on the input feature data loaded into the processing unit and the convolution kernel parameters, and output the operation result to the next processing unit, and the operation result of the last row of processing units is used as the output feature data.
  • One input feature data and one convolution kernel parameter can be loaded in one processing unit, or one processing unit can also be loaded with an input feature data block of size M ⁇ N and a convolution kernel parameter block of size K ⁇ L , R, H, M, N, K and L are all positive integers.
  • the first loading unit is used to load the input feature data into the corresponding processing unit, and the input feature data can be stored in a storage unit, and the storage unit can be a static random-access memory (Static Random-Access Memory, SRAM) or other types storage unit.
  • SRAM static random-access memory
  • the processing result of the processing unit is passed down and accumulated with the product of the processing unit in the next column of processing units, the time sequence of the input feature data of each row of processing units is delayed by one clock cycle by row, so that the processing of this row
  • the product of the input feature data of the unit and the convolution kernel parameters is calculated, it is exactly accumulated with the product sent by the previous line of processing units. If the input of the input feature data is stopped, the computation of at least one processing unit in the systolic processing array at the corresponding moment is also stopped.
  • the second loading unit is used to load the convolution kernel parameters into the corresponding processing unit, the convolution kernel parameters may be stored in the storage unit, and the storage unit for storing the convolution kernel parameters and the storage unit for storing the input feature data may be the same storage unit unit, or a different storage unit.
  • the convolution kernel parameters in the processing unit will remain unchanged, and the convolution kernel parameters will be reused for different input feature data flowing into the processing unit.
  • the loading order of the convolution kernel parameters and the input feature data is not limited here.
  • the convolution kernel parameters can be loaded earlier than the input feature data, or later than the input feature data, or at the same time as the input feature data. load.
  • different processing units can load convolution kernel parameters at the same time, or can load convolution kernel parameters sequentially in a certain order, for example, can be loaded into the systolic processing array in the same way as the input feature data.
  • the output unit is used for buffering the output data of the processing unit, or sending the processing result of the processing unit (ie, output characteristic data) to other processing units for further processing or storage.
  • the first loading unit can load one input feature map into the systolic processing array at a time.
  • multiple input feature maps can also be loaded into the systolic processing array at a time. process in the array.
  • the second loading unit can load one set of convolution kernel parameters into the systolic processing array at a time, or can load multiple sets of convolution kernel parameters into the systolic processing array.
  • the second loading unit can load four groups of convolution kernel parameters into the systolic processing unit.
  • the array different colored squares in the figure represent different convolution kernel parameters.
  • the first loading unit can input (row_x, row_y, row_z) in the first input feature map to The 3-row first processing unit of the systolic processing array inputs (row_x, row_y, row_z) in the second input feature map to the 3-row second processing unit of the systolic processing array.
  • the first processing units in the three rows are located before the second processing units in the three rows in the processing sequence. In this way, two input feature maps can be processed simultaneously.
  • the maximum demand rate for the input feature data is the number of H required per clock cycle, that is, each clock cycle can process each row of the systolic processing array.
  • the cell inputs an input feature data. For example, suppose the input feature data of the i-th row and the j-th column is called aij, and H is equal to 5. As shown in Figure 3, in the first clock cycle, a11 is input to M11.
  • a11 is transferred from M11 to the right to M12, new data a12 is input into M11, and a21 is input into M21 at the same time.
  • a11 is passed to the right to M13, a12 is passed to the right to M12, the new data a13 is input to M11, a21 is passed to the right to M22, the new data a22 is input to M21, and the new data a31 is input into M31.
  • the data in the gray box in the figure is the new data flowing into the systolic processing array every clock cycle.
  • input feature data may be stored in a storage unit, read from a data output port of the storage unit by a first loading unit, and loaded into a systolic processing array.
  • a portion of the input signature data of the input signature map read every clock cycle from a data output port is loaded into a row of the systolic processing array.
  • the minimum access unit of on-chip SRAM SRAM On Chip
  • n is an integer. That is to say, at least 2 n-th power bytes of data are accessed each time.
  • bit width of the data output port of the SRAM is equal to its minimum access unit, that is, the input feature data can be accessed in the SRAM according to the minimum access unit of the SRAM.
  • the technical solutions of the embodiments of the present disclosure are now described by taking the length of each data as 1 byte (ie, 8 bits) as an example. In practical applications, the length of a data may also be greater than 1
  • the situation when the length is other values is similar to the situation when the data length is 1 byte, and details are not described in this disclosure. Now consider the following situations:
  • Case 1 When the number of bits corresponding to the number of columns of the input feature map stored in the SRAM is greater than the amount of data corresponding to the minimum access unit of the SRAM, and is not equal to an integer multiple of the number corresponding to the minimum access unit, the data processing device operates at one clock.
  • the amount of data accessed from the SRAM periodically is less than the data amount of a row of data of the input feature map stored in the SRAM, and after an access operation is performed on a row of data of the input feature map stored in the SRAM, the row of the input feature data of the row
  • the efficiency of the access operation corresponding to the data at the tail will be reduced. As shown in FIG.
  • Case 2 The number of columns of the input feature map stored in the SRAM is relatively small, and one minimum access unit can correspond to multiple rows of input feature data. When only one row of input feature data needs to be accessed from the SRAM, although multiple rows of data can be accessed in one access operation, only one row of data is valid data. As shown in FIG. 5B , the white rectangles in the figure represent the input feature data on the 1st, 3rd, and 5th rows, and the gray rectangles represent the input feature data on the 2nd, 4th, and 6th rows.
  • the number of columns of a row of input feature data is 32 bytes
  • the number of bits of each input feature data in the row is 1 byte
  • the minimum access unit is 64 bytes
  • 64 bytes can be taken out when accessing the SRAM per clock cycle data (for example, including the 1st row data and the 2nd row data)
  • the systolic processing array is loaded with a row of data in the input feature data (that is, a row of input feature data)
  • 1st line of data or 2nd line of data is valid data. Therefore, the valid data is 32 bytes.
  • the data volume corresponding to the height H of the systolic processing array is exactly equal to the data volume corresponding to the minimum access unit
  • the data provided by the odd-numbered clock cycle is less than the maximum data consumption rate of the systolic processing array, so that the systolic processing array cannot obtain enough valid data for each even-numbered clock cycle.
  • the data provided per clock cycle cannot meet the maximum data consumption rate of the systolic processing array, that is, the systolic processing array cannot obtain enough valid data per clock cycle.
  • the height H of the systolic processing array is large (greater than the minimum access unit), the above phenomenon is more obvious.
  • an embodiment of the present disclosure provides a data processing apparatus for loading input feature data into a systolic processing array.
  • the apparatus may include:
  • a plurality of first loading units 602, each first loading unit of the plurality of first loading units 602 is used for:
  • the buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
  • the size of the systolic processing array may be determined according to the size of the convolution kernel parameters loaded in the systolic processing array, that is, a suitable size is selected according to the size of the convolution kernel parameters to be loaded.
  • Systolic processing array In order to improve the utilization of processing units in the systolic processing array, the size of the systolic processing array may be an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array. For example, if the convolution kernel parameters loaded in the systolic processing array are all K ⁇ L, the number of rows of the systolic processing array is an integer multiple of K, and the number of columns is an integer multiple of L.
  • the systolic processing array can be divided into several array blocks according to the dimension of "row", each array block includes one or more rows of processing units, and the heights of each array block can be equal.
  • the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array. That is to say, one array block corresponds to one convolution kernel parameter.
  • the size of one array block is equal to the sum of the sizes of multiple convolution kernel parameters loaded in the systolic processing array. For example, in FIG.
  • every 3 rows of processing units in the systolic processing array are regarded as one array block, so that the systolic processing array can be divided into two array blocks, array block 1 and array block 2 .
  • each small square represents a processing unit.
  • Each first loading unit can load input feature data into one array block in the systolic processing array, ie, each array block uses one first loading unit independently.
  • the size of the input feature data loaded into one array block by the first loading unit in each clock cycle is equal to the size of the data corresponding to the height of one array block.
  • Each first loading unit in the plurality of first loading units 602 may be connected to a first interface, and the first interface may be various types of interfaces, for example, an Application Programming Interface (Application Programming Interface, API), Remote Procedure Calls (RPC), or Remote Method Invocation (RMI).
  • Each first loading unit in the plurality of first loading units 602 can access a storage unit for storing input feature data in parallel through a first interface provided by itself.
  • the storage unit may be an SRAM or other type of storage device.
  • the storage unit may include a plurality of data output interfaces, and the first interface of each first loading unit communicates with a data output interface of the storage unit to obtain data from the corresponding data output interface of the storage unit. Get input feature data.
  • the data output interface may be the same type of interface as the first interface.
  • the storage unit may also include only one data output interface, and the first interface of each first loading unit communicates with the same data output interface of the storage unit to obtain data from the storage unit. The output interface obtains input feature data.
  • the time when the plurality of first loading units 602 load the input feature data to the processing unit in the u-th row of the systolic processing array is faster than the time when the plurality of first loading units 602 process the u+1-th row of the systolic processing array.
  • the time when the unit loads the input feature data is one clock cycle earlier, and u is a positive integer.
  • the time at which one of the plurality of first loading units 602 loads the input feature data to the processing unit in the u-th row of the systolic processing array is earlier than that of the first loading unit 602 of the plurality of first loading units 602 .
  • the time when another first loading unit loads the input feature data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
  • Communication can be performed between the plurality of first loading units 602, so that the output timing of the input feature data is maintained on a slope, that is, the input feature data received by the systolic processing array in the current row is higher than the input feature data received by the systolic processing array in the previous row. Delayed by one clock cycle, when the product of the input feature data received by the current row systolic processing array and the convolution kernel parameter is calculated, it is exactly accumulated with the product sent by the previous row of the systolic processing array.
  • each of the plurality of first loading units 602 may be provided with a second interface through which adjacent first loading units communicate.
  • the previous first loading unit may send data to the next first loading unit (for example, the kth first loading unit) through the second interface when the present loading unit completes loading data into the systolic processing array.
  • the k+1 th first loading unit sends a synchronization signal, and the next first loading unit starts to load data into the systolic processing array after receiving the synchronization signal.
  • the type of the second interface may be the same as or different from that of the first interface.
  • the sum of the buffering rates of the input feature data is greater than or equal to the loading rate of the systolic processing array.
  • the sum of the cache rates of the input feature data in each clock cycle may be greater than or equal to the loading rate of the systolic processing array in the clock cycle, or the sum of the average cache rates of the input feature data.
  • the sum is greater than or equal to the average loading rate of the systolic processing array, and may also be that the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array.
  • the average buffering rate is the ratio of the sum of the number of input feature data buffered in all clock cycles (including the current clock cycle) before the current clock cycle to the number of all the always cycles.
  • the input feature data is not loaded into the systolic processing array until each first loading unit is loaded.
  • the first loading unit can be used as a data pool to summarize the input feature data, and then stably output the data to the systolic processing array.
  • Each data output interface of the storage unit can output part or all of the input feature data in the same row to the first interface of the first loading unit within one clock cycle, for example, the data output interface 1 of the storage unit
  • the input characteristic data of the first row is output to the first interface 1 of the first loading unit in the first clock cycle
  • the data output interface 2 of the storage unit is to the first interface 2 of the first loading unit in the first clock cycle Output the input feature data for line 2, and so on.
  • Each of the first loading units may include at least one buffer sub-unit for buffering the input feature data acquired from the storage unit.
  • each cache sub-unit in the multiple cache units can sequentially acquire the input feature data from the storage unit in a certain order and cache it, and cache the data in the cache.
  • the input feature data of are sequentially loaded into at least one row of processing elements of the systolic processing array.
  • Each cache subunit of the plurality of cache subunits can start to cache the input feature data corresponding to the current cache subunit when the input feature data corresponding to the previous cache subunit is cached.
  • the loading of the previous cache subunit in the loading sequence is performed in parallel with the cache process of the latter cache subunit.
  • the number of cache subunits is 2
  • data can be cached through cache subunit 1, and when the cache subunit 1 is cached, the data in cache subunit 1 is loaded into systolic processing array. While the data in the cache subunit 1 is loaded, the data cache can be performed by the cache subunit 2 .
  • the buffering of the buffering subunit 2 is completed, the data in the buffering subunit 2 is loaded into the systolic processing array. While the data in the cache sub-unit 2 is loaded, the data can be cached by the cache sub-unit 1, and so on and so forth.
  • the number of cache subunits is greater than 2, the caching and loading methods are similar to the above-mentioned cases, and are not repeated here.
  • the cache subunit may include a plurality of cache blocks, and each cache block in the plurality of cache blocks is used to cache input feature data required by a row of processing units in the systolic processing array, and the loading subunit uses Loading the input feature data of each cache block cache in the plurality of cache blocks into a corresponding row of processing units in the systolic processing array; wherein, after the input feature data of the vth cache block cache is loaded, it can be Send a first load instruction to the v+1th cache block, so that the loading subunit starts to load the input feature data cached in the v+1th cache block, where v is an integer greater than 1.
  • the first loading unit may load the input feature data buffered in the buffer subunit into the systolic processing array.
  • each first loading unit in the plurality of cache subunits reads one input feature data from the corresponding cache subunit within one clock cycle and loads it into the systolic processing array. Since multiple first loading units acquire the input feature data from the storage units in parallel, the amount of data acquired per clock cycle is multiplied compared to the way of acquiring data through only one loading unit.
  • the systolic processing array The quantity of acquired valid data also increases exponentially, thereby reducing the idle situation of the processing units in the systolic processing array and improving the processing efficiency of the systolic processing unit.
  • the following two situations are respectively analyzed to illustrate the technical effects of the embodiments of the present disclosure:
  • each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S702, the two first loading units cache a total of 64 bytes of valid data in the first clock cycle; if in the second clock cycle, each first loading unit can add a new cache of 8 bytes of valid data. data.
  • the two first loading units add a total of 16 bytes of valid data to be buffered in the second clock cycle.
  • the two first loading units buffer a total of 48 bytes of valid data in the second clock cycle.
  • step S702 corresponding to the first clock cycle take out the 32-byte input feature data from the cache and load it into the pulsation processing unit, and in step S704, the two first loading units cache the remaining 32-byte input feature data;
  • the two first loading units take out the 32-byte input feature data from the cache and load it into the pulsating processing unit.
  • the two first loading units cache the remaining 16 bytes of input feature data. . Therefore, according to the solutions of the embodiments of the present disclosure, the data
  • the number of columns of the input feature data is relatively small. As shown in FIG. 7B , assuming that the number of columns of input feature data is 32 bytes, each input feature data is 1 byte, the minimum access unit is 64 bytes, and assuming that the number of first loading units is 2, then one The number of valid data buffered by the first loading unit in each clock cycle is 64 bytes. Initially, no valid data is loaded into each load unit. In the first clock cycle, each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S712, the two first loading units buffer a total of 64 bytes of valid data in the first clock cycle. In step S716, the two loading units buffer a total of 64 bytes of valid data in the second clock cycle.
  • step S712 corresponding to the first clock cycle, the 64-byte input feature data is taken out from the cache and loaded into the pulsation processing unit.
  • step S714 the two first loading units have no buffered data; in the second clock cycle corresponding to In step S716, the two first loading units take out the 64-byte input feature data from the cache and load them into the pulsation processing unit.
  • step S718 the two first loading units cache no buffered data. It can be seen that the buffering rate of the two first loading units is equal to the data consumption rate of the systolic processing array. Therefore, the systolic processing array can always obtain enough input feature data.
  • the number of input feature data buffered in each clock cycle is doubled, so that the buffered data can always provide enough data for data loading, improving the The processing efficiency of the systolic processing array is improved.
  • the height of the systolic processing array is large, in this way, data can be efficiently input to the systolic processing array, so that the amount of data loaded in each clock cycle is as close as possible to the height of the systolic processing array, so that the systolic processing array can be
  • the expansion of the height dimension is not limited, which is conducive to the flexible design of high-performance systolic processing arrays of different sizes.
  • the input feature map can be divided into blocks to obtain multiple data blocks.
  • the number is less than the total number of columns of the input feature data, and the number of rows of each data block is less than or equal to the total number of rows of the input feature data.
  • only one data block is cached and loaded at a time, and the next data block can be cached after the cached data block is loaded.
  • the number of columns of the data block may be equal to the number of data corresponding to the minimum access unit of the storage unit.
  • the number of columns of the data block is 32 columns.
  • the number of columns of the data block may also be equal to an integer multiple of the number of data corresponding to the minimum access unit of the storage unit.
  • each first loading unit in the plurality of first loading units may further perform first filling on the input feature data during the process of loading the input feature data. Since each first loading unit can be filled in the same way, the following will take the filling method of one of the first loading units (referred to as loading unit A) as an example for description.
  • the loading unit A may first obtain the description information of the input feature data, and the description information may include information used to indicate whether the input feature data needs to be filled, for example, use "0" or a null value to indicate that no filling is required, Use "1" to indicate that padding is required.
  • padding parameters may also be included, for example, the value of padding data, and the number of rows and/or columns of padding data.
  • the loading unit A can determine whether the input feature data needs to be filled according to the description information, if necessary, in the data loading process, can first determine whether the currently loaded data is input feature data or fill data, if it is input feature data, Then directly read the corresponding data from the cache to load; if it is filling data, generate filling data and load according to filling parameters.
  • the above-mentioned first filling may include filling at least one of the left boundary and the right boundary of the input feature data.
  • Filling the left border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; filling the right border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; Add at least one column to fill the data after the last 1 column of data.
  • the first loading unit includes: a sending subunit, configured to send a read instruction to the storage unit; a cache subunit, configured to respond to the input returned by the storage unit according to the read instruction
  • the feature data is buffered; and a loading subunit is used to load the buffered input feature data into at least one row of processing units in the systolic processing array.
  • the sending subunit may include a first parsing subunit for receiving a loading instruction and parsing the loading instruction to generate description information of the input feature data to be loaded; and a second parsing subunit for analyzing the loaded instruction.
  • the description information is parsed, and the read instruction is sent to the storage unit according to the analysis result.
  • the first parsing subunit may receive a loading instruction from the controller.
  • the present invention is not limited to this. According to other embodiments of the present invention, the first parsing subunit may also receive loading instructions from other apparatuses or storage devices.
  • the description information required by a first loading unit may specifically include some or all of the following:
  • the number of groups of input feature data to be loaded (ifm_num), which is used to indicate the total number of groups of input feature data that the first loading unit is responsible for loading.
  • an input feature map is a set of input feature data.
  • the number (ifm_start) of the first group of input feature data to be loaded which is used to indicate the identification information of the first group of input feature data that the first loading unit is responsible for loading.
  • the numbers of the remaining input feature data that the first loading unit is responsible for loading increases sequentially. For example, ifm_start is equal to 1, the number of the second group of input feature data is 2, the number of the third group of input feature data is 3, and so on. .
  • the first loading unit is responsible for the number of input feature data groups (ifm_num_perb) that can be processed simultaneously by an array block in the systolic processing array (generally including multiple rows of processing units in the systolic processing array), since the convolution processing is performed during the convolution processing. In the process, it is sometimes necessary to accumulate multiple sets of input feature data and the convolution processing results of the corresponding convolution kernel parameters, so the pulsation processing array needs to be able to process multiple sets of input feature data at the same time.
  • the input feature map includes R (Red) channel map, G (Green) channel map, and B (Blue) channel map.
  • Three sets of convolution kernel parameters are used to convolve the three channel maps, and then the three channel maps are convolved.
  • the convolution processing of the graph is accumulated, so one array block in the systolic processing array can be used to load three sets of convolution kernel parameters respectively, and the first loading unit can respectively load the above three channel graphs for these three array blocks. In this case, ifm_num_perb is equal to 3.
  • the height of the convolution kernel (kernel_h), indicating the height of the convolution kernel parameter in the array block that the first loading unit is responsible for loading.
  • the convolution kernels in the array block that the same first loading unit is responsible for loading may include all rows in a complete set of convolution kernel parameters, or may only include some rows in a complete convolution kernel.
  • the size of a set of convolution kernel parameters is 3 ⁇ 3
  • the height of the array block that a first loading unit is responsible for loading may be 5, and these 5 rows of array blocks may include three of the first set of convolution kernel parameters. row data, and two rows of data in the second set of convolution kernel parameters.
  • the depth of the buffer subunits included in the first loading unit may be equal to the depth of the array block that the first loading unit is responsible for loading.
  • the starting position (kernel_h_start) of the array block that the first loading unit is responsible for loading is used to indicate the row number of the first row processing unit that the first loading unit is responsible for loading in the pulsating processing unit.
  • the base address (ifm_baddr) of the input feature data to be loaded which is used to indicate the first address in the storage unit of the input feature data that the first loading unit is responsible for loading.
  • the address of the second group of input feature data is the sum of the bit lengths corresponding to the base address and ifm_len.
  • the width of the input feature data to be loaded (ifm_w), indicating the total number of columns of a set of input feature data.
  • width is greater than the minimum access unit of the storage unit, the same row of data in the input feature data can be loaded in multiple times.
  • the description information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here.
  • the description information is used to let the first loading unit know how to obtain the input feature data from the storage unit.
  • the second parsing subunit is further configured to: generate auxiliary loading information, where the auxiliary loading information is used to determine a loading mode of the input feature data.
  • the auxiliary loading information may include some or all of the following information:
  • the sliding step size (stride_h) of the convolution kernel parameter in the column direction of the input feature data that the first loading unit is responsible for loading which is used to indicate that when performing convolution processing, the convolution kernel slides down on the input feature data each time How many lines, when stride_h is equal to 1, slide down by 1 line, when stride_h is equal to 2, slide down by 2 lines, and so on.
  • This parameter allows the first loading unit to know which rows of input feature data need to be loaded into the systolic processing array.
  • the height of the dilated convolution kernel (dilate_h), which is used to indicate the interval of the number of lines of the input feature data convolved with the convolution kernel parameters.
  • the convolution kernel is 3 ⁇ 3
  • dilate_h the interval is 1, that is, each row (eg, row 1, row 2, and row 3) in the input feature data is divided into Convolve with 3 lines of data in the convolution kernel parameter;
  • dilate_h is 2, the interval is 2, that is, the interlaced (e.g., line 1, line 3, and line 5) data in the input feature data Convolve with the 3 lines of data in the convolution kernel parameters, respectively.
  • the height of the input feature data (ifm_h), used to represent the total number of rows of a set of input feature data.
  • the height of the output feature data (ofm_h), used to represent the total number of rows of a set of output feature data.
  • the number of padding data rows (pad_t) to pad the upper boundary of the input feature data is the number of padding data rows (pad_t) to pad the upper boundary of the input feature data.
  • the number of padding data rows (pad_b) to pad the lower boundary of the input feature data is the number of padding data rows (pad_b) to pad the lower boundary of the input feature data.
  • the number of padding data rows (pad_l) to pad the left border of the input feature data is the number of padding data rows (pad_l) to pad the left border of the input feature data.
  • the number of padding data rows (pad_r) to pad the right edge of the input feature data is the number of padding data rows (pad_r) to pad the right edge of the input feature data.
  • the auxiliary loading information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here.
  • the cached input feature data may also be rearranged according to the auxiliary loading information, so that the cache mode of the input feature data matches the loading mode, thereby improving data loading efficiency. For example, input feature data loaded to different line processing units can be cached in different cache addresses of cache subunits through rearrangement, or data required for loading can be filtered from the cached input feature data according to loading needs.
  • the cache subunit includes: a first cache subunit for caching the auxiliary loading information; and a second cache subunit for reading from the first cache subunit the auxiliary loading information, and cache the input feature data returned according to the read instruction according to the auxiliary loading information.
  • the first cache subunit may be a First In First Out (First In First Out, FIFO) queue.
  • the second cache subunit includes: a third cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction;
  • the auxiliary loading information is read in the first cache subunit, the input feature data buffered by the third cache subunit is rearranged according to the auxiliary loading information, and the rearranged input feature data is written into a fourth buffering subunit; and a fourth buffering subunit for buffering the rearranged input feature data for the loading subunit to load the rearranged input feature data into the systolic processing array middle.
  • the third buffer subunit may be a FIFO queue
  • the fourth buffer subunit may be a random access memory (Random Access Memory, RAM).
  • the read-write subunit is further configured to: in the process of rearranging the input feature data buffered by the first cache subunit, perform a second filling on the input feature data.
  • the second padding includes padding at least one of an upper boundary and a lower boundary of the input feature data.
  • FIG. 8 A schematic structural diagram of the first loading unit according to the embodiment of the present disclosure is shown in FIG. 8 .
  • the first parsing subunit 802 (denoted as IFM_SBLK_INFO in the figure) is used to receive an instruction, parse the instruction according to the requirements of the pulsating processing array for input feature data, and generate the description information to the second parsing subunit 804 ( The figure is marked as IFM_SBLK_RD).
  • the second parsing subunit 804 is used for parsing the description information, sending an instruction to read the input feature data to the storage unit, and the read-back input feature data is written into the third buffer subunit 806 (referred to as IFM_FIFO in the figure). And during the parsing process, auxiliary loading information is generated to the first buffer subunit 814 (indicated as INFO_FIFO in the figure).
  • the read-write subunit (RD_FIFO_WR_RAM) 808 is used to read data from the third cache subunit 806 according to the auxiliary loading information generated by the second analysis subunit 804, and load it into the fourth cache subunit 812 (denoted as IFM_RAM in the figure). ). This process completes both upper and lower boundary padding.
  • the loading subunit 810 (marked as WR_IFM_2MAC in the figure) is used to read data from the fourth buffer subunit 812 according to the requirements of the pulsating processing array for input feature data, and send it to the pulsating processing array in turn. This process completes the left boundary filling at the same time. and right border padding.
  • the third cache subunit 806 is used for storing the input feature data read out from the storage unit. That is, the third buffering subunit 806 buffers the input feature data sent by the storage unit, and loads the input feature data buffered in the third buffering subunit into the fourth buffering subunit 812 .
  • the width of the third cache sub-unit 806 is consistent with the bit width of the data port of the storage unit, that is, the smallest access unit of the storage unit.
  • the third cache subunit 806 is used for storing the read input feature data and loading the read input feature data to the information of the fourth cache subunit 812 .
  • the number of the fourth cache sub-units is multiple, and the multiple fourth cache sub-units sequentially cache the rearranged input feature data; the loading sub-unit is used to sequentially The input feature data in each of the plurality of fourth buffer sub-units is loaded into the systolic processing array.
  • each cache subunit may use a fourth cache subunit respectively, and the remaining functional units may share the same.
  • Each fourth cache subunit of the plurality of fourth cache subunits starts to cache the rearranged input feature data when the previous fourth cache subunit is cached.
  • each of the fourth cache subunits in the plurality of fourth cache subunits may also include a plurality of fifth cache subunits, and each of the plurality of fifth cache subunits included in the fourth cache subunit.
  • the height of the fifth cache subunit is equal to the height of the fourth cache subunit
  • the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.
  • the number of the fourth cache subunits (that is, the IFM_RAM in the figure) is 3, which are respectively called the cache subunits 902 (denoted as ping_ram in the figure) and the cache subunits 904 (denoted as pong_ram in the figure) and cache subunit 906 (denoted as peng_ram in the figure).
  • the fourth cache subunit stores the input characteristic data according to the requirement of the systolic processing array for the input characteristic data.
  • ping-pong-peng three groups of the same IFM_RAM can work in pipeline to speed up the loading of the pulsating processing array.
  • each group of fourth cache subunits is equal to the height h of one array block in the systolic processing array, and each group of fourth cache subunits includes several small RAMs (eg, ping_ram0) of depth h, the width of which is the same as that of the input
  • the bit width of the characteristic data is the same.
  • the present disclosure uses 8 bits for illustration, but the present disclosure is applicable to other bit widths as well. For different application scenarios, it is not necessary to instantiate three sets of IFM_RAM. For example, when performance requirements are low, only one group of IFM_RAM may be instantiated; for another example, when performance requirements are high, more groups of IFM_RAM may be instantiated until increasing IFM_RAM can no longer improve performance.
  • Each first loading unit can be synchronized through a synchronization signal out_sync, which identifies whether the corresponding cache subunit 902, cache subunit 904, and cache subunit 906 in each first loading unit have loaded data and are ready to be output to the pulsating process array. Only when the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 in each of the first loading units is loaded with data, the selection unit 910 will save the data from the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 Select the corresponding data in the systolic processing array, and output this round of data to the systolic processing array.
  • each of the plurality of first loading units is capable of sequentially reading a plurality of data blocks in the input feature data from the storage unit, and at least one data block in the plurality of data blocks is When the block cache is completed, the at least one data block can be loaded into an array block in the systolic array.
  • FIG. 10 it is a processing flow chart of the data processing apparatus according to the embodiment of the present disclosure. It should be noted that this processing flow is only a schematic diagram of a possible processing manner of the data processing apparatus of the present disclosure, and is not used to limit the present disclosure.
  • Step 1002 Receive an instruction, where the instruction can carry at least one of the above-mentioned description information.
  • Step 1004 Parse the instruction to obtain the description information.
  • Step 1006 Read row data according to the minimum access unit of the storage unit. That is, starting from the line corresponding to the convolution kernel height starting position kernel_h_start in the first input feature map ifm_start, each line accesses the minimum access unit once until the kernel_h line data corresponding to one convolution kernel is accessed, and continues to access the next The kernel_h row data of the next set of input feature data corresponding to the convolution kernel, each row is accessed once the minimum access unit.
  • kernel_h is the height of the convolution kernel.
  • Step 1008 According to the steps of Step 1006 , complete the row access corresponding to the number of ifm_num_perb input feature data mapped to the input feature map of the array block at one time, and each row access corresponds to a minimum access unit.
  • ifm_num_perb is the number of input feature maps corresponding to one mapping of the array block.
  • Step 1010 Execute step 1006 and step 1008 in a loop, visit the minimum access unit once per row, until the current minimum access unit can cover the entire row, then the convolution kernel sliding step stride_h corresponding to kernel_h in an input feature map is completed.
  • the convolution kernel parameters need to move the output feature map height ofm_h times in total on the input feature data.
  • Step 1012 Circularly execute steps 1006 to 1010 to complete the moving scan of the convolution kernel parameters of the output feature map height ofm_h times on the input feature data. So far, the loading of all input feature data corresponding to ifm_num_perb is completed.
  • Step 1014 Steps 1006 to 1012 are executed cyclically to complete the loading of all the input feature maps ifm with the number ifm_num_perb of the input feature maps mapped at one time by the array block.
  • Step 1016 End.
  • the number of columns of the systolic processing array is an integer multiple of three. Since the minimum access unit of storage units such as SRAM is 2 to the nth power of bytes, the dimensions of most of the systolic processing arrays in the industry are also 2 to the nth power of bytes.
  • the convolution kernel of size 3 ⁇ 3 accounts for a high proportion of the convolution kernels of various sizes, and the 3 ⁇ 3 convolution kernel is mapped to the pulsation of the n-th power byte dimension of 2 When processing the array, some processing units cannot be mapped to, resulting in a waste of pulsating processing array resources.
  • designing the number of columns of the systolic sequence array to be an integer multiple of 3 can effectively reduce resource waste in most cases, thereby improving the processing efficiency of the systolic processing array.
  • the size of the systolic processing array is 32 ⁇ 32 (that is, each row and each column of the systolic processing array has 32 processing units) , which is also the size of the pulsation processing array of the common 1K scale CNN accelerator in the industry, but for a 3 ⁇ 3 convolution kernel, there are always two rows and two columns of such a pulsation processing array that cannot be used, resulting in a valuable pulsation processing array. Waste of computing resources. As shown in FIG.
  • each small square represents a processing unit 1106, a plurality of first loading units 1102 respectively load input feature maps IFM0 to input feature maps IFM9, and a plurality of second loading units 1104 respectively load weights Weight 0 to weight Weight 9.
  • the plurality of first loading units 1102 transmit respective corresponding input feature maps to the corresponding plurality of processing units 106 at different times.
  • the corresponding plurality of processing units are located in one array block. That is, the plurality of first loading units 1102 transmit respective corresponding input feature maps to corresponding array blocks at different times.
  • the first loading unit 1102 loaded with the input feature map IFM0 transmits the input feature map IFM0 to the n-row processing unit 1106 adjacent to the first loading unit 1102, where n is the number of rows.
  • n corresponds to the size of the convolution kernel.
  • n is equal to 3 when the size of the convolution kernel is 3 ⁇ 3. That is, n is equal to the width or length of the convolution kernel.
  • the time when the two adjacent first loading units 1102 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle.
  • the plurality of second loading units 1104 transmit their corresponding weights to the processing unit 1106 at different times.
  • the second loading unit 1104 loaded with the weight Weight 0 transmits the weight Weight 0 to the n-row processing unit 1106 adjacent to the second loading unit 1104, where n is the number of rows.
  • n corresponds to the size of the convolution kernel.
  • n is equal to 3 when the size of the convolution kernel is 3 ⁇ 3. That is, n is equal to the width or length of the convolution kernel. It should be noted that the time when the two adjacent first loading units 1104 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle.
  • every three rows of processing units 1106 in the above-mentioned systolic processing array can load up to 10 groups of convolution kernels parameter.
  • FIG. 11B is a schematic diagram of the length of a systolic processing array according to an embodiment of the present disclosure. It should be noted that the same elements in FIG. 11B as those in FIG. 11A have the same or similar functions as those in FIG. 11A . As shown in FIG. 11B ,
  • each small square in the figure still represents a processing unit 1106, then each group of 3 ⁇ 3 processing units 1106 in the above-mentioned systolic processing array can be used to load a 3 ⁇ 3 convolution kernel , one convolution kernel parameter kij can be loaded in each processing unit 1106, and there is no idle processing unit 1106. Therefore, ideally, the utilization of the systolic processing array is 100%.
  • the above-mentioned data processing apparatus may be a processing chip, for example, an SoC chip, or may be a computer device.
  • FIG. 12 shows a schematic diagram of the hardware structure of a more specific data processing apparatus provided by an embodiment of this specification.
  • the apparatus may include: a processor 1202 , a memory 1204 , an input/output interface 1206 , a communication interface 1208 and a bus 1210 .
  • the processor 1202 , the memory 1204 , the input/output interface 1206 and the communication interface 1208 realize the communication connection among each other within the device through the bus 1210 .
  • the processor 1202 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.
  • a general-purpose CPU Central Processing Unit, central processing unit
  • a microprocessor central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 1204 can be implemented in the form of ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, and the like.
  • the memory 1204 can store the operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1204 and invoked by the processor 1202 for execution.
  • the input/output interface 1206 is used for connecting input/output modules to realize information input and output.
  • the input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions.
  • the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc.
  • the output device may include a display, a speaker, a vibrator, an indicator light, and the like.
  • the communication interface 1208 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices.
  • the communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).
  • the bus 1210 includes a path that transfers information between the various components of the device (eg, the processor 1202, the memory 1204, the input/output interface 1206, and the communication interface 1208).
  • the above-mentioned device only shows the processor 1202, the memory 1204, the input/output interface 1206, the communication interface 1208 and the bus 1210, in the specific implementation process, the device may also include necessary components for normal operation. other components.
  • the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, and does not necessarily include all the components shown in the figures.
  • an embodiment of the present disclosure further provides a data processing system 1400, which may include the data processing apparatus (eg, data processing apparatus 1406) of any of the above-mentioned embodiments, and a systolic processing array 1408 for loading the data processing apparatus 1408.
  • a data processing system 1400 may include the data processing apparatus (eg, data processing apparatus 1406) of any of the above-mentioned embodiments, and a systolic processing array 1408 for loading the data processing apparatus 1408.
  • Input feature data and convolution kernel parameters and perform convolution processing on the input feature data and the convolution kernel parameters.
  • system further includes: a second loading unit 1404 for loading the convolution kernel parameters into the systolic processing array.
  • the system further includes: a storage unit for storing the input feature data.
  • the storage unit 1302 includes: a plurality of mutually independent storage subunits, each storage subunit is used to store part of the data in the input feature data; the plurality of storage subunits
  • the first loading unit is configured to access different storage subunits at the same time to acquire input feature data stored in the corresponding storage subunits.
  • the storage unit 1302 further includes a scheduling unit 1304, configured to receive the access requests of the multiple first loading units, and send the access requests to the corresponding storage sub-units 1306, so that the multiple first loading units The first load unit accesses the corresponding storage subunit 1306 .
  • the system further includes: an output unit, configured to acquire the processing result output by the systolic processing array, and store the processing result, or output the processing result.
  • the system may be implemented based on a Field Programmable Gate Array (FPGA) or an integrated chip (Application Specific Integrated Circuit, ASIC).
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • an embodiment of the present disclosure further provides a neural network accelerator 1500, characterized in that, the neural network accelerator includes the data processing apparatus 1502 described in any embodiment of the present disclosure, or includes the present disclosure The data processing system 1504 of any embodiment.
  • the neural network accelerator is a CNN accelerator or a Recurrent Neural Network (RNN) accelerator.
  • RNN Recurrent Neural Network
  • the present disclosure also provides a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array, The method includes:
  • Step 1602 Access a storage unit in parallel through each of the plurality of first loading units, and read input feature data from the storage unit;
  • Step 1604 Cache the read input feature data
  • Step 1606 Load the buffered input feature data into at least one row of processing units in the systolic processing array.
  • the sum of the buffering rates of valid data in the input feature data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
  • the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or the sum of the average buffering rates of the input feature data The sum is greater than or equal to the average loading rate of the systolic processing array; or the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array.
  • the method further includes: in the process of loading the input feature data, performing a first filling on the input feature data.
  • the performing the first padding on the input feature data includes: padding at least one of a left border and a right border of the input feature data.
  • the accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit includes: reading the input feature data from the storage unit each time The unit reads a data block in the input feature data; the loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes: when the data block is cached, The data block is loaded into at least one row of processing units in the systolic processing array.
  • the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
  • each of the plurality of first loading units includes a sending subunit, a buffering subunit and a loading subunit; the passing through each of the plurality of first loading units The first loading unit accesses the storage unit in parallel, and reads the input characteristic data from the storage unit, including: sending a read instruction to the storage unit through the sending subunit; buffering the input feature data returned by the read instruction; and loading the buffered input feature data into at least one row of processing units in the systolic processing array through the loading subunit.
  • the height of the cache subunit is equal to the height of the systolic processing array.
  • each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the height of the corresponding array block.
  • the number of the cache sub-units is multiple; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: through the multiple cache sub-units, sequentially The buffering according to the input characteristic data returned by the read instruction; and the loading of the buffered input characteristic data into at least one row of processing units in the systolic processing array by loading subunits includes: sequentially loading the Input feature data in each of the plurality of cache subunits is loaded into the at least one row of processing units of the systolic processing array.
  • the step of sequentially caching the input feature data returned according to the read instruction through the multiple cache subunits includes: a previous cache in the multiple cache subunits When the buffering of the input feature data corresponding to the subunits is completed, the buffering of the input feature data corresponding to the current buffering subunit among the plurality of buffering subunits is started.
  • the cache subunit includes a plurality of cache blocks; the caching of the input feature data returned according to the read instruction by the cache subunit includes: passing through the cache blocks in the plurality of cache blocks.
  • Each cache block buffers the input characteristic data required by one row of processing units in the systolic processing array; the loading subunit loads the buffered input characteristic data into at least one row of processing units in the systolic processing array in the systolic processing array, including: loading the input feature data buffered by each buffer block in the plurality of buffer blocks into a corresponding row of processing units in the systolic processing array through the loading subunit; wherein, in the vth buffer After the input feature data of the block cache is loaded, a first load instruction is sent to the v+1th cache block, so that the loading subunit starts to load the input feature data of the v+1th cache block cache. ; where v is an integer greater than 1.
  • accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit includes: in the multiple first loading units After the input feature data corresponding to each of the first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to load the systolic processing array the input feature data.
  • the sending subunit includes a first parsing subunit and a second parsing subunit; and the sending a read instruction to the storage unit by the sending subunit includes: passing the first parsing subunit , receive the loading instruction, parse the loading instruction to generate description information of the input feature data to be loaded; and parse the description information through the second parsing subunit, and send the description information to the storage unit according to the parsing result the read command.
  • the description information includes at least one of the following: the number of groups of input feature data to be loaded, the number of the first group of input feature data to be loaded, and the group of input feature data that can be processed simultaneously by the corresponding array block. number, the number of groups of input feature data processed by the pulsation processing array at the same time, the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the location of each group of input feature data to be loaded The size of the occupied storage space and the width of the input feature data to be loaded.
  • the cache sub-unit includes a first cache sub-unit and a second cache sub-unit; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: generating an auxiliary loading information, and determine the loading mode of the input feature data to be loaded through the auxiliary loading information; cache the auxiliary loading information through the first cache subunit; The auxiliary loading information is read in the first cache subunit, and the input feature data returned according to the read instruction is cached according to the auxiliary loading information.
  • the second cache subunit reads the auxiliary loading information from the first cache subunit, and caches the input feature data according to the auxiliary loading information, further comprising: : through the third cache subunit, cache the input characteristic data returned by the storage unit according to the read instruction; through the read-write subunit, read the auxiliary loading information from the first cache subunit, Rearrange the input feature data cached by the third cache subunit according to the auxiliary loading information, and write the rearranged input feature data into the fourth cache subunit; and pass the fourth cache subunit a unit for buffering the rearranged input feature data, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
  • the number of the fourth buffer sub-units is multiple; the caching of the rearranged input feature data by the fourth buffer sub-unit includes: using a plurality of fourth buffer sub-units a unit that sequentially caches the rearranged input feature data; the loading subunit loads the rearranged input feature data into the systolic processing array, including: through the loading subunit, sequentially The input feature data in each of the plurality of fourth buffer subunits is loaded into the systolic processing array.
  • the rearranged input feature data corresponding to the previous fourth cache subunit in the plurality of fourth cache subunits when the rearranged input feature data corresponding to the previous fourth cache subunit in the plurality of fourth cache subunits is cached, start to cache the plurality of fourth cache subunits.
  • the rearranged input feature data corresponding to the current fourth cache subunit in the unit is cached.
  • each fourth cache subunit of the plurality of fourth cache subunits includes a plurality of fifth cache subunits, and among the plurality of fifth cache subunits included in the fourth cache subunit
  • the height of each fifth cache subunit is equal to the height of the fourth cache subunit
  • the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.
  • the method further includes: in the process of rearranging the input feature data cached by the third cache subunit, performing a second filling on the input feature data cached by the third cache subunit .
  • performing the second filling on the input feature data cached by the third cache subunit includes: performing at least one of an upper boundary and a lower boundary of the input feature data cached by the third cache subunit One is filled.
  • the auxiliary loading information includes at least one of the following: a sliding step size of a convolution kernel parameter in a column direction of the input feature data, a height of a hole convolution kernel, and a padding parameter.
  • the systolic processing array includes a plurality of array blocks, each of which is used to process a set of input feature data; the buffered input feature data is loaded into at least one row in the systolic processing array
  • the processing unit includes: respectively loading the input feature data to at least one array block through each of the plurality of first loading units.
  • each first loading unit of the plurality of first loading units includes a cache subunit, and the depth of the cache subunit is equal to the depth of the array block loaded by the first loading unit; the The caching of the read input feature data includes: caching the input feature data read from the storage unit through the cache subunit.
  • the heights of each array block are all equal.
  • the size of an array block is equal to the size of a convolution kernel parameter loaded in the systolic processing array.
  • the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
  • the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
  • the number of columns of the systolic processing array is an integer multiple of three.
  • the time when the plurality of first loading units load data to the u-th processing unit of the systolic processing array is earlier than the time when the plurality of first loading units load the u+1-th processing unit of the systolic processing array
  • the time when the row processing unit loads data is one clock cycle earlier, and u is a positive integer.
  • the specific embodiment of the first loading unit in the above method embodiment is the same as the embodiment of the first loading unit 602 in the foregoing data processing apparatus, and details are not described herein again.
  • Embodiments of the present disclosure further include a data processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the program described in any of the embodiments when the processor executes the program.
  • the above-mentioned data processing apparatus may be a data processing chip, for example, a system-on-chip (SoC).
  • SoC system-on-chip
  • the embodiments of the present specification further provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the foregoing embodiments.
  • Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
  • a typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable device, or a combination of any of these devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

Embodiments of the present disclosure provide a data processing method, apparatus, and system, and a neural network accelerator, used for loading input feature data into a systolic processing array. The apparatus comprises a plurality of first load cells, each of the plurality of first load cells configured to: access a memory cell in parallel to read the input feature data from the memory cell; cache the read input feature data; and load the cached input feature data into at least one row of processing cells in the systolic processing array.

Description

数据处理装置、方法和系统以及神经网络加速器Data processing apparatus, method and system, and neural network accelerator 技术领域technical field
本公开涉及人工智能技术领域,具体而言,涉及数据处理装置、方法和系统以及神经网络加速器。The present disclosure relates to the technical field of artificial intelligence, and in particular, to a data processing apparatus, method and system, and a neural network accelerator.
背景技术Background technique
一些神经网络中常常包含许多卷积处理,在进行卷积处理时,需要将输入特征数据与卷积核参数装载到神经网络中的脉动处理阵列,再由脉动处理阵列对输入特征数据与卷积核参数进行计算,得到输出特征数据。在传统的数据装载方式中,常常发生脉动处理阵列中的处理单元闲置的情况,使得脉动处理阵列的处理效率较低。Some neural networks often contain a lot of convolution processing. When performing convolution processing, it is necessary to load the input feature data and convolution kernel parameters into the pulsation processing array in the neural network, and then use the pulsation processing array to combine the input feature data with convolution. The kernel parameters are calculated to obtain the output feature data. In the traditional data loading method, it often happens that the processing units in the systolic processing array are idle, so that the processing efficiency of the systolic processing array is low.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本公开的实施例提出了数据处理方法、装置和系统以及神经网络加速器,以解决相关技术中脉动处理阵列的处理效率较低的技术问题。In view of this, the embodiments of the present disclosure propose a data processing method, apparatus and system, and a neural network accelerator to solve the technical problem of low processing efficiency of a pulsating processing array in the related art.
根据本公开实施例的第一方面,提供一种数据处理装置,用于将输入特征数据装载到脉动处理阵列中,所述装置包括:多个第一装载单元,所述多个第一装载单元中的每个第一装载单元用于:并行地访问存储单元,以从所述存储单元读取输入特征数据,对读取的所述输入特征数据进行缓存,以及将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。According to a first aspect of embodiments of the present disclosure, there is provided a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, the plurality of first loading units Each of the first loading units is used to: access a storage unit in parallel to read input feature data from the storage unit, cache the read input feature data, and cache the cached input feature data Loaded into at least one row of processing elements in a systolic processing array.
根据本公开实施例的第二方面,提供一种数据处理系统,包括任一实施例所述的数据处理装置;以及脉动处理阵列,用于装载所述输入特征数据和卷积核参数,并对所述输入特征数据和所述卷积核参数进行卷积处理。According to a second aspect of the embodiments of the present disclosure, there is provided a data processing system, comprising the data processing apparatus described in any one of the embodiments; and a systolic processing array for loading the input feature data and convolution kernel parameters, and for The input feature data and the convolution kernel parameters are subjected to convolution processing.
根据本公开实施例的第三方面,提供一种神经网络加速器,所述神经网络加速器包括任一实施例所述的数据处理装置,或者包括任一实施例所述的数据处理系统。According to a third aspect of the embodiments of the present disclosure, there is provided a neural network accelerator, where the neural network accelerator includes the data processing apparatus described in any embodiment, or includes the data processing system described in any embodiment.
根据本公开实施例的第四方面,提供一种数据处理方法,应用于包括多个第一装载单元的数据处理装置中的每个第一装载单元,以将输入特征数据装载到脉动处理阵列中,所述方法包括:通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取所述输入特征数据;对读取的所述输入特征数据进 行缓存;以及将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。According to a fourth aspect of the embodiments of the present disclosure, there is provided a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array , the method includes: accessing a storage unit in parallel by each first loading unit in the plurality of first loading units, and reading the input feature data from the storage unit; buffering the data; and loading the buffered input feature data into at least one row of processing units in the systolic processing array.
根据本公开实施例的第五方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本公开任一实施例所述的方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the embodiments of the present disclosure.
根据本公开实施例的第六方面,提供一种数据处理装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现本公开任一实施例所述的方法中由任一第一装载单元所执行的步骤。According to a sixth aspect of the embodiments of the present disclosure, there is provided a data processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implements the present disclosure when executing the program Steps performed by any of the first loading units in the method of any of the embodiments.
应用本公开实施例方案,多个第一装载单元并行地从存储单元获取输入特征数据,相比于仅通过一个装载单元获取数据的方式,脉动处理阵列获取到的有效数据的数量成倍地增加了,从而减少了脉动处理阵列中处理单元的闲置情况,提高了脉动处理单元的处理效率。By applying the solution of the embodiment of the present disclosure, a plurality of first loading units acquire the input feature data from the storage unit in parallel, and the number of valid data acquired by the systolic processing array is doubled compared to the way of acquiring data through only one loading unit Therefore, the idle situation of the processing units in the pulsating processing array is reduced, and the processing efficiency of the pulsating processing unit is improved.
附图说明Description of drawings
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.
图1是根据本公开实施例的脉动处理阵列的处理方式的示意图。FIG. 1 is a schematic diagram of a processing manner of a systolic processing array according to an embodiment of the present disclosure.
图2是根据本公开实施例的数据装载方式的示意图。FIG. 2 is a schematic diagram of a data loading manner according to an embodiment of the present disclosure.
图3是根据本公开实施例的脉动处理阵列中的数据流动过程示意图。3 is a schematic diagram of a data flow process in a systolic processing array according to an embodiment of the present disclosure.
图4是根据本公开实施例的数据读取和装载过程的示意图。4 is a schematic diagram of a data read and load process according to an embodiment of the present disclosure.
图5A和5B是根据本公开实施例的从存储单元中读取到的有效数据的示意图。5A and 5B are schematic diagrams of valid data read from a memory cell according to an embodiment of the present disclosure.
图6是根据本公开实施例的数据处理装置的示意图。FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.
图7A和图7B是根据本公开实施例的数据装载过程中缓存数据量变化示意图。7A and 7B are schematic diagrams illustrating changes in the amount of cached data during a data loading process according to an embodiment of the present disclosure.
图8是根据本公开实施例的第一装载单元的结构示意图。FIG. 8 is a schematic structural diagram of a first loading unit according to an embodiment of the present disclosure.
图9是根据本公开实施例的实例化多个缓存单元的示意图。9 is a schematic diagram of instantiating multiple cache units according to an embodiment of the present disclosure.
图10是根据本公开实施例的数据处理装置的处理流程图。FIG. 10 is a processing flowchart of a data processing apparatus according to an embodiment of the present disclosure.
图11A是传统的脉动处理阵列的长度的示意图。Figure 11A is a schematic diagram of the length of a conventional systolic processing array.
图11B是根据本公开实施例的脉动处理阵列的长度的示意图。11B is a schematic diagram of the length of a systolic processing array in accordance with an embodiment of the present disclosure.
图12是根据本公开实施例的计算机设备的示意图。12 is a schematic diagram of a computer device according to an embodiment of the present disclosure.
图13是根据本公开实施例的存储单元的示意图。13 is a schematic diagram of a memory cell according to an embodiment of the present disclosure.
图14是根据本公开实施例的数据处理系统的示意图。14 is a schematic diagram of a data processing system according to an embodiment of the present disclosure.
图15A和图15B分别是根据本公开实施例的神经网络加速器的示意图。15A and 15B are respectively schematic diagrams of neural network accelerators according to embodiments of the present disclosure.
图16是根据本公开实施例的数据处理方法的流程图。FIG. 16 is a flowchart of a data processing method according to an embodiment of the present disclosure.
具体实施方式detailed description
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开说明书和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."
根据本公开一实施方式,提供一种数据处理装置,用于将输入特征数据装载到脉动处理阵列中,所述装置包括:多个第一装载单元,所述多个第一装载单元中的每个第一装载单元能够并行地访问存储单元,以从所述存储单元读取输入特征数据,对读取的所述输入特征数据进行缓存,并能够将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。According to an embodiment of the present disclosure, there is provided a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, each of the plurality of first loading units A first loading unit can access a storage unit in parallel to read input feature data from the storage unit, buffer the read input feature data, and can load the buffered input feature data into a systolic process At least one row of processing units in the array.
在卷积神经网络(Convolutional Neural Network,CNN)等神经网络中,常常 包含许多卷积处理,即,对输入特征数据和卷积核参数(也称权重数据)进行卷积,得到输出特征数据。这里的输入特征数据可以是输入特征图(Input Feature Map,IFM),其可以来自于图像、语音或者文本等;相应地,输出特征数据为输出特征图(Output Feature Map,OFM),其可以转换为图像、语音或者文本。在进行卷积处理时,需要将输入特征数据与卷积核参数装载到神经网络中的脉动处理阵列,再由脉动处理阵列对输入特征数据与卷积核参数进行卷积处理,得到输出特征数据。脉动处理阵列是一种简单高效的处理装置,对于卷积核参数固定(Weight Stationary)的脉动处理阵列,输入特征数据在脉动处理阵列中流水复用,可降低对输入特征数据的输入带宽需求。In neural networks such as Convolutional Neural Network (CNN), many convolution processes are often included, that is, the input feature data and convolution kernel parameters (also called weight data) are convolved to obtain output feature data. The input feature data here can be an input feature map (Input Feature Map, IFM), which can come from images, voice or text, etc.; correspondingly, the output feature data is an output feature map (Output Feature Map, OFM), which can be converted for image, voice or text. When performing convolution processing, it is necessary to load the input feature data and convolution kernel parameters into the pulsation processing array in the neural network, and then the pulsation processing array performs convolution processing on the input feature data and the convolution kernel parameters to obtain the output feature data. . The systolic processing array is a simple and efficient processing device. For the systolic processing array with fixed convolution kernel parameters (Weight Stationary), the input feature data is multiplexed in the systolic processing array, which can reduce the input bandwidth requirement for the input feature data.
一种卷积核参数固定的脉动处理阵列的工作原理如图1所示。常见的脉动处理阵列是矩形的,包括R×H个处理单元,即每行有H个处理单元,每列有R个处理单元,R和H可以相等也可以不相等。脉动处理阵列的尺寸越大,对输入特征数据的带宽需求也越大。其中,第i行第j列的处理单元用于将本时钟周期(cycle)获取到的输入特征数据在下一个时钟周期输入到第i行第j+1列的处理单元,并用于将运算结果发送给第i+1行第j列的处理单元,以使第i+1行第j列的处理单元在下一个时钟周期能够对第i行第j列的处理单元的运算结果以及第i+1行第j列的处理单元的运算结果进行累加。应当说明的是,处理单元并不一定对获取到的每个输入特征数据都进行处理,例如,在第1行第1列的处理单元将第1行第1列的输入特征数据传递给第1行第2列的处理单元之后,第1行第2列的处理单元并不将该输入特征数据与本处理单元获取到的卷积核参数进行相乘,而是直接传递给第1行第3列的处理单元。每个处理单元可用于对装载到该处理单元的输入特征数据与卷积核参数进行乘法和加法运算,并将运算结果输出给下一个处理单元,最后一行处理单元的运算结果作为输出特征数据。一个处理单元中可以装载一个输入特征数据和一个卷积核参数,或者,一个处理单元中也可以装载一个大小为M×N的输入特征数据块和一个大小为K×L的卷积核参数块,R、H、M、N、K和L均为正整数。The working principle of a systolic processing array with fixed convolution kernel parameters is shown in Figure 1. A common systolic processing array is rectangular, including R×H processing units, that is, each row has H processing units, and each column has R processing units, and R and H may or may not be equal. The larger the size of the systolic processing array, the greater the bandwidth requirement on the input feature data. Among them, the processing unit in the i-th row and the j-th column is used to input the input characteristic data obtained in this clock cycle (cycle) to the processing unit in the i-th row and the j+1th column in the next clock cycle, and is used to send the operation result. Give the processing unit at the i+1th row and the jth column, so that the processing unit at the i+1th row and the jth column can perform the operation result of the i+1th row and the jth column of the processing unit and the i+1th row in the next clock cycle. The operation results of the processing units in the jth column are accumulated. It should be noted that the processing unit does not necessarily process each acquired input feature data. For example, the processing unit in row 1 and column 1 transmits the input feature data in row 1 and column 1 to the first After the processing unit in row 2 and column 2, the processing unit in row 1 and column 2 does not multiply the input feature data with the convolution kernel parameters obtained by this processing unit, but directly passes it to row 1 and 3 The processing unit of the column. Each processing unit can be used to perform multiplication and addition operations on the input feature data loaded into the processing unit and the convolution kernel parameters, and output the operation result to the next processing unit, and the operation result of the last row of processing units is used as the output feature data. One input feature data and one convolution kernel parameter can be loaded in one processing unit, or one processing unit can also be loaded with an input feature data block of size M×N and a convolution kernel parameter block of size K×L , R, H, M, N, K and L are all positive integers.
第一装载单元用于将输入特征数据装载到对应的处理单元,输入特征数据可以存储在存储单元中,所述存储单元可以是静态随机存取存储器(Static Random-Access Memory,SRAM)或者其他类型的存储单元。由于处理单元的处理结果是向下传递并在下一列处理单元中,与该处理单元的乘积进行累加,每行处理单元的输入特征数据的时序,就是按行延时一个时钟周期,这样该行处理单元的输入特征数据与卷积核参数的乘积计算出来时,正好与上一行处理单元送下来的乘积进行累加。如果输入特征数据的输入停止,则脉动处理阵列中至少一个处理单元在相应时刻的计算也是停止的。The first loading unit is used to load the input feature data into the corresponding processing unit, and the input feature data can be stored in a storage unit, and the storage unit can be a static random-access memory (Static Random-Access Memory, SRAM) or other types storage unit. Since the processing result of the processing unit is passed down and accumulated with the product of the processing unit in the next column of processing units, the time sequence of the input feature data of each row of processing units is delayed by one clock cycle by row, so that the processing of this row When the product of the input feature data of the unit and the convolution kernel parameters is calculated, it is exactly accumulated with the product sent by the previous line of processing units. If the input of the input feature data is stopped, the computation of at least one processing unit in the systolic processing array at the corresponding moment is also stopped.
第二装载单元用于将卷积核参数装载到对应的处理单元,卷积核参数可以存储在存储单元中,存储卷积核参数的存储单元与存储输入特征数据的存储单元可以是同一个存储单元,也可以是不同的存储单元。在特定计算周期内,处理单元中的卷积核参数将保持不变,对流入该处理单元的不同的输入特征数据,复用该卷积核参数。其中,卷积核参数与输入特征数据的装载顺序此处不做限制,例如,卷积核参数的装载可以早于输入特征数据装,也可以晚于输入特征数据,还可以与输入特征数据同时装载。此外,不同的处理单元可以同时装载卷积核参数,也可以按照一定的顺序依次装载卷积核参数,例如,可以按照与输入特征数据相同的方式装载到脉动处理阵列中。输出单元用于缓存处理单元的输出数据,或者将处理单元的处理结果(即输出特征数据)发送至其他处理单元进行进一步处理或者存储。The second loading unit is used to load the convolution kernel parameters into the corresponding processing unit, the convolution kernel parameters may be stored in the storage unit, and the storage unit for storing the convolution kernel parameters and the storage unit for storing the input feature data may be the same storage unit unit, or a different storage unit. During a specific computing cycle, the convolution kernel parameters in the processing unit will remain unchanged, and the convolution kernel parameters will be reused for different input feature data flowing into the processing unit. The loading order of the convolution kernel parameters and the input feature data is not limited here. For example, the convolution kernel parameters can be loaded earlier than the input feature data, or later than the input feature data, or at the same time as the input feature data. load. In addition, different processing units can load convolution kernel parameters at the same time, or can load convolution kernel parameters sequentially in a certain order, for example, can be loaded into the systolic processing array in the same way as the input feature data. The output unit is used for buffering the output data of the processing unit, or sending the processing result of the processing unit (ie, output characteristic data) to other processing units for further processing or storage.
以装载输入特征图和卷积核参数为例,第一装载单元每次可以将一张输入特征图装载到脉动处理阵列中,为了提高处理效率,也可以一次将多张输入特征图装载到脉动处理阵列中。同理,第二装载单元每次可以将一组卷积核参数装载到脉动处理阵列中,也可以将多组卷积核参数装载到脉动处理阵列中。如图2所示,假设每组卷积核参数的尺寸是3×3,第i行第j列的处理单元称为Mij,则第二装载单元可以将四组卷积核参数装载到脉动处理阵列中(图中不同颜色的方块表示不同的卷积核参数)。假设每组卷积核对应输入特征图中的三个独立的行数据(row_x,row_y,row_z),则第一装载单元可以将第1张输入特征图中的(row_x,row_y,row_z)输入至脉动处理阵列的3行第一处理单元,将第2张输入特征图中的(row_x,row_y,row_z)输入至脉动处理阵列的3行第二处理单元。其中,3行第一处理单元在处理顺序上位于3行第二处理单元之前。通过这种方式,能够同时对两张输入特征图进行处理。Taking the loading of input feature maps and convolution kernel parameters as an example, the first loading unit can load one input feature map into the systolic processing array at a time. In order to improve processing efficiency, multiple input feature maps can also be loaded into the systolic processing array at a time. process in the array. Similarly, the second loading unit can load one set of convolution kernel parameters into the systolic processing array at a time, or can load multiple sets of convolution kernel parameters into the systolic processing array. As shown in Figure 2, assuming that the size of each group of convolution kernel parameters is 3×3, and the processing unit in the i-th row and j-th column is called Mij, the second loading unit can load four groups of convolution kernel parameters into the systolic processing unit. In the array (different colored squares in the figure represent different convolution kernel parameters). Assuming that each group of convolution kernels corresponds to three independent row data (row_x, row_y, row_z) in the input feature map, the first loading unit can input (row_x, row_y, row_z) in the first input feature map to The 3-row first processing unit of the systolic processing array inputs (row_x, row_y, row_z) in the second input feature map to the 3-row second processing unit of the systolic processing array. Wherein, the first processing units in the three rows are located before the second processing units in the three rows in the processing sequence. In this way, two input feature maps can be processed simultaneously.
假设脉动处理阵列的高度为H,即,共有H行脉动处理阵列,则每个时钟周期最多有H个新数据流入脉动处理阵列。也就是说,该脉动处理阵列对输入特征数据进行全速处理时,对输入特征数据的最大需求速率是每个时钟周期需求H个数,即,每时钟周期都能为脉动处理阵列的每行处理单元输入一个输入特征数据。例如,假设第i行第j列的输入特征数据称为aij,H等于5。如图3所示,在第1个时钟周期,a11输入到M11中。在第2个时钟周期,a11由M11向右传递至M12,新的数据a12输入到M11中,同时a21输入到M21中。在第3个时钟周期,a11向右传递至M13,a12向右传递至M12,新的数据a13输入到M11中,a21向右传递至M22,新的数据a22输入到M21中,同时新的数据a31输入到M31中。图中灰色方框内的数据为每个时钟周期流入脉动处理阵列的新数据。以此类推,从第5个时钟周期开始,每次都有5 个新的数据输入到脉动处理阵列中的第一列处理单元中,从而脉动处理阵列的最大数据需求速率为5。Assuming that the height of the systolic processing array is H, that is, there are H rows of the systolic processing array, at most H new data flows into the systolic processing array per clock cycle. That is to say, when the systolic processing array processes the input feature data at full speed, the maximum demand rate for the input feature data is the number of H required per clock cycle, that is, each clock cycle can process each row of the systolic processing array. The cell inputs an input feature data. For example, suppose the input feature data of the i-th row and the j-th column is called aij, and H is equal to 5. As shown in Figure 3, in the first clock cycle, a11 is input to M11. In the second clock cycle, a11 is transferred from M11 to the right to M12, new data a12 is input into M11, and a21 is input into M21 at the same time. In the third clock cycle, a11 is passed to the right to M13, a12 is passed to the right to M12, the new data a13 is input to M11, a21 is passed to the right to M22, the new data a22 is input to M21, and the new data a31 is input into M31. The data in the gray box in the figure is the new data flowing into the systolic processing array every clock cycle. By analogy, starting from the 5th clock cycle, 5 new data are input to the first column of processing units in the systolic processing array each time, so that the maximum data demand rate of the systolic processing array is 5.
如图4所示,输入特征数据(例如,输入特征图)可存储在存储单元中,由第一装载单元从存储单元的数据输出端口读取输入特征数据,并装载到脉动处理阵列中。例如,一个数据输出端口每个时钟周期读出的输入特征图的一部分输入特征数据装载到脉动处理阵列的某一行。一些存储单元,例如,片上SRAM(SRAM On Chip)的最小访问单位一般是2的n次方字节,n为整数。也就是说,每次最少访问2的n次方字节数据。假设SRAM的数据输出端口的位宽等于其最小访问单位,也就是说,输入特征数据可以按照SRAM的最小访问单位在SRAM中进行存取。为了便于描述,现以每个数据的长度均为1字节(即8位)为例对本公开实施例的技术方案进行说明,在实际应用中,一个数据的长度也可以大于1字节,数据长度为其他值时的情况与数据长度为1字节时的情况类似,本公开不再赘述。现在考虑下面几种情况:As shown in FIG. 4, input feature data (eg, input feature map) may be stored in a storage unit, read from a data output port of the storage unit by a first loading unit, and loaded into a systolic processing array. For example, a portion of the input signature data of the input signature map read every clock cycle from a data output port is loaded into a row of the systolic processing array. For some storage units, for example, the minimum access unit of on-chip SRAM (SRAM On Chip) is generally 2 to the nth power of bytes, where n is an integer. That is to say, at least 2 n-th power bytes of data are accessed each time. It is assumed that the bit width of the data output port of the SRAM is equal to its minimum access unit, that is, the input feature data can be accessed in the SRAM according to the minimum access unit of the SRAM. For ease of description, the technical solutions of the embodiments of the present disclosure are now described by taking the length of each data as 1 byte (ie, 8 bits) as an example. In practical applications, the length of a data may also be greater than 1 The situation when the length is other values is similar to the situation when the data length is 1 byte, and details are not described in this disclosure. Now consider the following situations:
情况一:存储于SRAM中的输入特征图的列数对应的位数大于SRAM的最小访问单位对应的数据量,且不等于该最小访问单位对应的数量的整数倍时,数据处理装置在一个时钟周期从SRAM访问到的数据量小于存储于SRAM中的输入特征图的一行数据的数据量,并且对存储于SRAM中的输入特征图的一行数据进行一次访问操作之后,该行输入特征数据的行尾的数据对应的访问操作的效率会降低。如图5A所示,假设一行输入特征数据的列数为40(即每行的输入特征数据有40个字节),该行中每个输入特征数据的位数为1字节,最小访问单位为32字节,则第1个时钟周期访问SRAM时从输入特征数据中取出32字节数据,第2个时钟周期访问SRAM时由于只剩下最后8字节数据,所以,虽然能够访问到32字节数据,但是,其中只有8字节数据是有效数据。Case 1: When the number of bits corresponding to the number of columns of the input feature map stored in the SRAM is greater than the amount of data corresponding to the minimum access unit of the SRAM, and is not equal to an integer multiple of the number corresponding to the minimum access unit, the data processing device operates at one clock. The amount of data accessed from the SRAM periodically is less than the data amount of a row of data of the input feature map stored in the SRAM, and after an access operation is performed on a row of data of the input feature map stored in the SRAM, the row of the input feature data of the row The efficiency of the access operation corresponding to the data at the tail will be reduced. As shown in FIG. 5A , assuming that the number of columns of a row of input feature data is 40 (that is, the input feature data of each row has 40 bytes), the number of bits of each input feature data in the row is 1 byte, and the minimum access unit If it is 32 bytes, then 32 bytes of data are taken out from the input feature data when accessing the SRAM in the first clock cycle, and only the last 8 bytes of data are left when accessing the SRAM in the second clock cycle, so although 32 bytes can be accessed. Byte data, however, of which only 8 bytes of data are valid data.
情况二:存储于SRAM中的输入特征图的列数比较小,一个最小访问单位可以对应于多行输入特征数据。当只需要从SRAM中访问一行输入特征数据的时候,虽然可以在一次访问操作中可以访问到多行数据,但是其中只有一行数据为有效数据。如图5B所示,图中白色矩形表示第1、3、5行输入特征数据,灰色矩形表示第2、4、6行输入特征数据。假设一行输入特征数据的列数为32字节,该行中每个输入特征数据的位数为1字节,最小访问单位为64字节,则每个时钟周期访问SRAM时可以取出64字节数据(例如,包括第1行数据和第2行数据),但是,由于每次向脉动处理阵列中装载的是输入特征数据中的一行数据(即,一行输入特征数据),因此,其中 只有第1行数据或者第2行数据为有效数据。因此,有效数据为32字节。Case 2: The number of columns of the input feature map stored in the SRAM is relatively small, and one minimum access unit can correspond to multiple rows of input feature data. When only one row of input feature data needs to be accessed from the SRAM, although multiple rows of data can be accessed in one access operation, only one row of data is valid data. As shown in FIG. 5B , the white rectangles in the figure represent the input feature data on the 1st, 3rd, and 5th rows, and the gray rectangles represent the input feature data on the 2nd, 4th, and 6th rows. Assuming that the number of columns of a row of input feature data is 32 bytes, the number of bits of each input feature data in the row is 1 byte, and the minimum access unit is 64 bytes, then 64 bytes can be taken out when accessing the SRAM per clock cycle data (for example, including the 1st row data and the 2nd row data), however, since each time the systolic processing array is loaded with a row of data in the input feature data (that is, a row of input feature data), only the first row of data is loaded into the systolic array. 1st line of data or 2nd line of data is valid data. Therefore, the valid data is 32 bytes.
假设脉动处理阵列的高度H对应的数据量恰好等于所述最小访问单位对应的数据量,则在第一种情况下,只有第奇数个时钟周期提供的数据能满足脉动处理阵列的最大数据消耗速率,第偶数个时钟周期提供的数据小于脉动处理阵列的最大数据消耗速率,从而每个偶数时钟周期脉动处理阵列都无法获取到足够多的有效数据。在第二种情况下,每个时钟周期提供的数据都无法满足脉动处理阵列的最大数据消耗速率,也就是说,每个时钟周期脉动处理阵列都无法获取到足够多的有效数据。进一步地,当脉动处理阵列的高度H较大(大于最小访问单位)时,上述现象更加明显。Assuming that the data volume corresponding to the height H of the systolic processing array is exactly equal to the data volume corresponding to the minimum access unit, in the first case, only the data provided by the odd-numbered clock cycle can satisfy the maximum data consumption rate of the systolic processing array , the data provided by the even-numbered clock cycle is less than the maximum data consumption rate of the systolic processing array, so that the systolic processing array cannot obtain enough valid data for each even-numbered clock cycle. In the second case, the data provided per clock cycle cannot meet the maximum data consumption rate of the systolic processing array, that is, the systolic processing array cannot obtain enough valid data per clock cycle. Further, when the height H of the systolic processing array is large (greater than the minimum access unit), the above phenomenon is more obvious.
可以看出,在上述两种情况下,由于一个时钟周期获取的有效数据的数量减少,使得每个时钟周期无法为脉动处理阵列装载H个数据,从而导致脉动处理阵列中部分处理单元闲置(没有数据流入),处理效率降低。由于脉动处理阵列占用着神经网络加速器芯片的大部分面积,因此,脉动处理阵列的效率降低就意味着神经网络加速器芯片的低效,直接降低了芯片的性价比。传统的数据装载方式往往采用一种脉动处理阵列来对各种尺寸的输入特征数据进行处理,这样,当脉动处理阵列的高度与SRAM的数据输出端口不相等(即,一个脉动处理阵列的高度对应的数据的尺寸与SRAM的数据输出端口输出的数据的尺寸不相等)时,无法保证每个时钟周期提供给脉动处理阵列的输入特征数据尽量接近该脉动处理阵列的高度H。It can be seen that in the above two cases, due to the reduction in the number of valid data acquired in one clock cycle, it is impossible to load H data for the systolic processing array in each clock cycle, resulting in some processing units in the systolic processing array being idle (no data flow), the processing efficiency is reduced. Since the pulsation processing array occupies most of the area of the neural network accelerator chip, the reduction of the efficiency of the pulsation processing array means the inefficiency of the neural network accelerator chip, which directly reduces the cost performance of the chip. Traditional data loading methods often use a systolic processing array to process input feature data of various sizes, so that when the height of the systolic processing array is not equal to the data output port of the SRAM (that is, the height of a systolic processing array corresponds to When the size of the data is not equal to the size of the data output from the data output port of the SRAM), it cannot be guaranteed that the input feature data provided to the systolic processing array in each clock cycle is as close to the height H of the systolic processing array as possible.
基于此,本公开实施例提供一种数据处理装置,用于将输入特征数据装载到脉动处理阵列中,如图6所示,所述装置可包括:Based on this, an embodiment of the present disclosure provides a data processing apparatus for loading input feature data into a systolic processing array. As shown in FIG. 6 , the apparatus may include:
多个第一装载单元602,所述多个第一装载单元602中每个第一装载单元用于:A plurality of first loading units 602, each first loading unit of the plurality of first loading units 602 is used for:
并行地访问存储单元,以从所述存储单元读取输入特征数据,accessing memory cells in parallel to read input feature data from the memory cells,
对读取的所述输入特征数据进行缓存,以及buffering the read input feature data, and
将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
在一个实施方式中,所述脉动处理阵列的尺寸可以根据所述脉动处理阵列中所装载的卷积核参数的尺寸而确定,即,根据要装载的卷积核参数的尺寸来选择尺寸合适的脉动处理阵列。为了提高脉动处理阵列中处理单元的利用率,所述脉动处理阵列的尺寸可以是所述脉动处理阵列中所装载的一个卷积核参数的尺寸的整数倍。例如,所述脉动处理阵列中所装载的卷积核参数均为K×L的,则脉动处理阵列的行数为K的整数倍,列数为L的整数倍。In one embodiment, the size of the systolic processing array may be determined according to the size of the convolution kernel parameters loaded in the systolic processing array, that is, a suitable size is selected according to the size of the convolution kernel parameters to be loaded. Systolic processing array. In order to improve the utilization of processing units in the systolic processing array, the size of the systolic processing array may be an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array. For example, if the convolution kernel parameters loaded in the systolic processing array are all K×L, the number of rows of the systolic processing array is an integer multiple of K, and the number of columns is an integer multiple of L.
在本公开实施例中,可以按照“行”的维度,把脉动处理阵列分成若干个阵列块,每一个阵列块包括一行或多行处理单元,各个阵列块的高度可以相等。可选地,一个阵列块的尺寸等于所述脉动处理阵列中所装载的一个卷积核参数的尺寸。也就是说,一个阵列块对应装载一个卷积核参数。或者,可选地,一个阵列块的尺寸等于所述脉动处理阵列中所装载的多个卷积核参数的尺寸之和。例如,在图6中,将脉动处理阵列中的每3行处理单元作为一个阵列块,从而可以将脉动处理阵列划分为阵列块1和阵列块2这两个阵列块。其中,每个小方块表示一个处理单元。In the embodiment of the present disclosure, the systolic processing array can be divided into several array blocks according to the dimension of "row", each array block includes one or more rows of processing units, and the heights of each array block can be equal. Optionally, the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array. That is to say, one array block corresponds to one convolution kernel parameter. Or, optionally, the size of one array block is equal to the sum of the sizes of multiple convolution kernel parameters loaded in the systolic processing array. For example, in FIG. 6 , every 3 rows of processing units in the systolic processing array are regarded as one array block, so that the systolic processing array can be divided into two array blocks, array block 1 and array block 2 . Among them, each small square represents a processing unit.
每个第一装载单元均可将输入特征数据装载到脉动处理阵列中的一个阵列块,即,每个阵列块独立地使用一个第一装载单元。在一个实施方式中,第一装载单元在每个时钟周期装载至一个阵列块的输入特征数据的尺寸等于一个阵列块的高度对应的数据的尺寸。所述多个第一装载单元602中的每个第一装载单元均可以连接至一个第一接口,所述第一接口可以是各种类型的接口,例如,应用程序编程接口(Application Programming Interface,API)、远程过程调用接口(Remote Procedure Calls,RPC)或者远程方法调用接口(Remote Method Invocation,RMI)。所述多个第一装载单元602中的每个第一装载单元可以并行地通过自身设有的第一接口访问用于存储输入特征数据的存储单元。所述存储单元可以是SRAM或者其他类型的存储设备。Each first loading unit can load input feature data into one array block in the systolic processing array, ie, each array block uses one first loading unit independently. In one embodiment, the size of the input feature data loaded into one array block by the first loading unit in each clock cycle is equal to the size of the data corresponding to the height of one array block. Each first loading unit in the plurality of first loading units 602 may be connected to a first interface, and the first interface may be various types of interfaces, for example, an Application Programming Interface (Application Programming Interface, API), Remote Procedure Calls (RPC), or Remote Method Invocation (RMI). Each first loading unit in the plurality of first loading units 602 can access a storage unit for storing input feature data in parallel through a first interface provided by itself. The storage unit may be an SRAM or other type of storage device.
可选地,所述存储单元可包括多个数据输出接口,每个第一装载单元的第一接口与所述存储单元的一个数据输出接口进行通信,以从所述存储单元的相应数据输出接口获取输入特征数据。为了便于通信,所述数据输出接口可以是与所述第一接口的类型相同的接口。可选地,所述存储单元也可以仅包括一个数据输出接口,每个第一装载单元的第一接口均与所述存储单元的同一个数据输出接口进行通信,以从所述存储单元的数据输出接口获取输入特征数据。Optionally, the storage unit may include a plurality of data output interfaces, and the first interface of each first loading unit communicates with a data output interface of the storage unit to obtain data from the corresponding data output interface of the storage unit. Get input feature data. To facilitate communication, the data output interface may be the same type of interface as the first interface. Optionally, the storage unit may also include only one data output interface, and the first interface of each first loading unit communicates with the same data output interface of the storage unit to obtain data from the storage unit. The output interface obtains input feature data.
所述多个第一装载单元602向所述脉动处理阵列的第u行处理单元装载输入特征数据的时刻比所述多个第一装载单元602向所述脉动处理阵列的第u+1行处理单元装载输入特征数据的时刻早一个时钟周期,u为正整数。在一实施方式中,所述多个第一装载单元602中的一个第一装载单元向脉动处理阵列的第u行处理单元装载输入特征数据的时刻比所述多个第一装载单元602中的另一个第一装载单元向脉动处理阵列的第u+1行处理单元装载输入特征数据的时刻早一个时钟周期,u为正整数。The time when the plurality of first loading units 602 load the input feature data to the processing unit in the u-th row of the systolic processing array is faster than the time when the plurality of first loading units 602 process the u+1-th row of the systolic processing array. The time when the unit loads the input feature data is one clock cycle earlier, and u is a positive integer. In one embodiment, the time at which one of the plurality of first loading units 602 loads the input feature data to the processing unit in the u-th row of the systolic processing array is earlier than that of the first loading unit 602 of the plurality of first loading units 602 . The time when another first loading unit loads the input feature data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
所述多个第一装载单元602之间可以进行通信,使得对输入特征数据的输出时序保持在一个斜面,即当前行脉动处理阵列收到输入特征数据比上一行脉动处理阵列 收到输入特征数据延时一个时钟周期,所述当前行脉动处理阵列收到的输入特征数据与卷积核参数的乘积计算出来时,正好与上一行脉动处理阵列送下来的乘积进行累加。在一些实施例中,所述多个第一装载单元602中的每个第一装载单元可以设有第二接口,相邻的第一装载单元通过所述第二接口进行通信。其中,上一个第一装载单元(例如,第k个第一装载单元)可以在本装载单元完成向脉动处理阵列装载数据的情况下,通过所述第二接口向下一个第一装载单元(例如,第k+1个第一装载单元)发送同步信号,下一个第一装载单元在接收到该同步信号之后,开始将数据装载到脉动处理阵列中。所述第二接口的类型可以与所述第一接口相同,也可以不同。Communication can be performed between the plurality of first loading units 602, so that the output timing of the input feature data is maintained on a slope, that is, the input feature data received by the systolic processing array in the current row is higher than the input feature data received by the systolic processing array in the previous row. Delayed by one clock cycle, when the product of the input feature data received by the current row systolic processing array and the convolution kernel parameter is calculated, it is exactly accumulated with the product sent by the previous row of the systolic processing array. In some embodiments, each of the plurality of first loading units 602 may be provided with a second interface through which adjacent first loading units communicate. Wherein, the previous first loading unit (for example, the kth first loading unit) may send data to the next first loading unit (for example, the kth first loading unit) through the second interface when the present loading unit completes loading data into the systolic processing array. , the k+1 th first loading unit) sends a synchronization signal, and the next first loading unit starts to load data into the systolic processing array after receiving the synchronization signal. The type of the second interface may be the same as or different from that of the first interface.
在一些实施例中,所述输入特征数据的缓存速率之和大于或等于所述脉动处理阵列的装载速率。例如,可以是每个时钟周期内所述输入特征数据的缓存速率之和均大于或等于所述脉动处理阵列在该时钟周期内的装载速率,也可以是所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的平均装载速率,还可以是所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的最大装载速率。其中,所述平均缓存速率为当前时钟周期之前的所有时钟周期(包括当前时钟周期)内缓存的输入特征数据的数量总和与所述所有始终周期的数量之比。所述平均装载速率为当前时钟周期之前的所有时钟周期(包括当前时钟周期)内装载的输入特征数据的数量总和与所述所有始终周期的数量之比。例如,当前时钟周期为第2个时钟周期,第1个时钟周期缓存的输入特征数据为4字节,第2个时钟周期缓存的输入特征数据为2字节,则平均缓存速率为(4+2)/2=3字节/时钟周期。由于数据缓存的速率大于数据消耗的速率,因此,可以保证脉动处理整列总能获取到足够多的数据。在另一些实施例中,每个第一装载单元装载完成之后,才向脉动处理阵列装载输入特征数据。通过上述任一方式,使得第一装载单元作为一个数据池,能够对输入特征数据进行汇总,再稳定地向脉动处理阵列输出数据。In some embodiments, the sum of the buffering rates of the input feature data is greater than or equal to the loading rate of the systolic processing array. For example, the sum of the cache rates of the input feature data in each clock cycle may be greater than or equal to the loading rate of the systolic processing array in the clock cycle, or the sum of the average cache rates of the input feature data. The sum is greater than or equal to the average loading rate of the systolic processing array, and may also be that the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array. The average buffering rate is the ratio of the sum of the number of input feature data buffered in all clock cycles (including the current clock cycle) before the current clock cycle to the number of all the always cycles. The average loading rate is the ratio of the sum of the number of input feature data loaded in all clock cycles (including the current clock cycle) before the current clock cycle to the number of all the constant cycles. For example, if the current clock cycle is the second clock cycle, the input feature data buffered in the first clock cycle is 4 bytes, and the input feature data buffered in the second clock cycle is 2 bytes, the average buffer rate is (4+ 2)/2=3 bytes/clock cycle. Since the rate of data caching is greater than the rate of data consumption, it can be guaranteed that the entire column of pulsed processing can always obtain enough data. In other embodiments, the input feature data is not loaded into the systolic processing array until each first loading unit is loaded. Through any of the above methods, the first loading unit can be used as a data pool to summarize the input feature data, and then stably output the data to the systolic processing array.
所述存储单元的每个数据输出接口在一个时钟周期内可以向所述第一装载单元的第一接口输出位于同一行的部分或全部输入特征数据,例如,所述存储单元的数据输出接口1在第一个时钟周期向第一装载单元的第一接口1输出第1行的输入特征数据,所述存储单元的数据输出接口2在第一个时钟周期向第一装载单元的第一接口2输出第2行的输入特征数据,等等。Each data output interface of the storage unit can output part or all of the input feature data in the same row to the first interface of the first loading unit within one clock cycle, for example, the data output interface 1 of the storage unit The input characteristic data of the first row is output to the first interface 1 of the first loading unit in the first clock cycle, and the data output interface 2 of the storage unit is to the first interface 2 of the first loading unit in the first clock cycle Output the input feature data for line 2, and so on.
每个第一装载单元均可包括至少一个缓存子单元,用于对从所述存储单元获取到的输入特征数据进行缓存。在所述第一装载单元包括多个缓存子单元的情况下,多 个缓存中单元中的每个缓存子单元可以按照一定的次序依次从所述存储单元获取输入特征数据并缓存,并将缓存的输入特征数据依次装载到脉动处理阵列的至少一行处理单元中。所述多个缓存子单元中的每个缓存子单元在前一个缓存子单元对应的输入特征数据缓存完成的情况下,能够开始对当前缓存子单元对应的输入特征数据进行缓存。其中,装载顺序上的前一个缓存子单元的装载与后一个缓存子单元的缓存过程并行执行。例如,在缓存子单元的数量为2时,在第1个时钟周期,可以通过缓存子单元1进行数据缓存,在缓存子单元1缓存完成的情况下,将缓存子单元1中的数据装载到脉动处理阵列中。在对缓存子单元1中的数据进行装载的同时,可以通过缓存子单元2进行数据缓存。在缓存子单元2缓存完成的情况下,将缓存子单元2中的数据装载到脉动处理阵列中。在对缓存子单元2中的数据进行装载的同时,又可以通过缓存子单元1进行数据缓存,如此循环往复。缓存子单元数量大于2时的缓存和装载方式与上述情况类似,此处不再赘述。Each of the first loading units may include at least one buffer sub-unit for buffering the input feature data acquired from the storage unit. In the case where the first loading unit includes multiple cache sub-units, each cache sub-unit in the multiple cache units can sequentially acquire the input feature data from the storage unit in a certain order and cache it, and cache the data in the cache. The input feature data of are sequentially loaded into at least one row of processing elements of the systolic processing array. Each cache subunit of the plurality of cache subunits can start to cache the input feature data corresponding to the current cache subunit when the input feature data corresponding to the previous cache subunit is cached. The loading of the previous cache subunit in the loading sequence is performed in parallel with the cache process of the latter cache subunit. For example, when the number of cache subunits is 2, in the first clock cycle, data can be cached through cache subunit 1, and when the cache subunit 1 is cached, the data in cache subunit 1 is loaded into systolic processing array. While the data in the cache subunit 1 is loaded, the data cache can be performed by the cache subunit 2 . When the buffering of the buffering subunit 2 is completed, the data in the buffering subunit 2 is loaded into the systolic processing array. While the data in the cache sub-unit 2 is loaded, the data can be cached by the cache sub-unit 1, and so on and so forth. When the number of cache subunits is greater than 2, the caching and loading methods are similar to the above-mentioned cases, and are not repeated here.
所述缓存子单元可包括多个缓存块,所述多个缓存块中的每个缓存块用于缓存所述脉动处理阵列中的一行处理单元所需的输入特征数据,所述装载子单元用于将所述多个缓存块中的每个缓存块缓存的输入特征数据装载到所述脉动处理阵列中对应的一行处理单元;其中,第v个缓存块缓存的输入特征数据装载完成之后,能够向第v+1个缓存块发送第一装载指令,以使所述装载子单元开始对所述第v+1个缓存块缓存的输入特征数据装载进行装载,v为大于1的整数。The cache subunit may include a plurality of cache blocks, and each cache block in the plurality of cache blocks is used to cache input feature data required by a row of processing units in the systolic processing array, and the loading subunit uses Loading the input feature data of each cache block cache in the plurality of cache blocks into a corresponding row of processing units in the systolic processing array; wherein, after the input feature data of the vth cache block cache is loaded, it can be Send a first load instruction to the v+1th cache block, so that the loading subunit starts to load the input feature data cached in the v+1th cache block, where v is an integer greater than 1.
所述第一装载单元在完成将输入特征数据缓存到对应的缓存子单元之后,可以将所述缓存子单元中缓存的输入特征数据装载到脉动处理阵列中。在进行输入特征数据的装载时,多个缓存子单元中的每个第一装载单元在一个时钟周期内从对应的缓存子单元中读取一个输入特征数据装载到脉动处理阵列中。由于多个第一装载单元并行地从存储单元获取输入特征数据,相比于仅通过一个装载单元获取数据的方式,每个时钟周期获取到的数据量成倍地增加了,因此,脉动处理阵列获取到的有效数据的数量也成倍地增加了,从而减少了脉动处理阵列中处理单元的闲置情况,提高了脉动处理单元的处理效率。下面分别针对前文所述的两种情况进行分析,以说明本公开实施例的技术效果:After the first loading unit completes the buffering of the input feature data into the corresponding buffer subunit, the first loading unit may load the input feature data buffered in the buffer subunit into the systolic processing array. When loading the input feature data, each first loading unit in the plurality of cache subunits reads one input feature data from the corresponding cache subunit within one clock cycle and loads it into the systolic processing array. Since multiple first loading units acquire the input feature data from the storage units in parallel, the amount of data acquired per clock cycle is multiplied compared to the way of acquiring data through only one loading unit. Therefore, the systolic processing array The quantity of acquired valid data also increases exponentially, thereby reducing the idle situation of the processing units in the systolic processing array and improving the processing efficiency of the systolic processing unit. The following two situations are respectively analyzed to illustrate the technical effects of the embodiments of the present disclosure:
针对情况一,假设一行输入特征数据的列数为40字节,该行中每个输入特征数据的位数为1字节,最小访问单位为32字节,即输入特征数据的列数对应的字节数大于SRAM的最小访问单位对应的字节数。如图7A所示,最开始的时候,每一个装 载单元中没有装载有效数据。在第1个时钟周期,每个第一装载单元可以新增缓存32字节有效数据。因此,在步骤S702中,两个第一装载单元在第1个时钟周期共缓存64字节有效数据;若在第2个时钟周期,每个第一装载单元可以分别新增缓存8字节有效数据。因此,两个第一装载单元在第2个时钟周期共新增缓存16字节有效数据。在步骤S706中,两个第一装载单元在第2个时钟周期共缓存48字节有效数据。假设脉动处理阵列的高度H等于所述最小访问单位对应的数量(即,H=32),且在第1个时钟周期和在第2个时钟周期消耗的输入特征数据的数量均为H,则在第1个时钟周期对应的步骤S702中,从缓存中取出32字节输入特征数据装载到脉动处理单元,在步骤S704中两个第一装载单元缓存剩余的32字节输入特征数据;在第2个时钟周期对应的步骤S706中两个第一装载单元再从缓存中取出32字节输入特征数据装载到脉动处理单元,在步骤S708中两个第一装载单元缓存剩余16字节输入特征数据。因此,按照本公开实施例的方案,能够满足脉动处理阵列的数据装载需求。For case 1, it is assumed that the number of columns of a row of input feature data is 40 bytes, the number of bits of each input feature data in the row is 1 byte, and the minimum access unit is 32 bytes, that is, the number of columns of input feature data corresponds to The number of bytes is greater than the number of bytes corresponding to the minimum access unit of SRAM. As shown in Fig. 7A, at the beginning, no valid data is loaded in each loading unit. In the first clock cycle, each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S702, the two first loading units cache a total of 64 bytes of valid data in the first clock cycle; if in the second clock cycle, each first loading unit can add a new cache of 8 bytes of valid data. data. Therefore, the two first loading units add a total of 16 bytes of valid data to be buffered in the second clock cycle. In step S706, the two first loading units buffer a total of 48 bytes of valid data in the second clock cycle. Assuming that the height H of the systolic processing array is equal to the number corresponding to the minimum access unit (ie, H=32), and the number of input feature data consumed in the first clock cycle and in the second clock cycle is H, then In step S702 corresponding to the first clock cycle, take out the 32-byte input feature data from the cache and load it into the pulsation processing unit, and in step S704, the two first loading units cache the remaining 32-byte input feature data; In step S706 corresponding to 2 clock cycles, the two first loading units take out the 32-byte input feature data from the cache and load it into the pulsating processing unit. In step S708, the two first loading units cache the remaining 16 bytes of input feature data. . Therefore, according to the solutions of the embodiments of the present disclosure, the data loading requirements of the systolic processing array can be satisfied.
针对情况二,输入特征数据的列数比较小。如图7B所示,假设输入特征数据的列数为32字节,每个输入特征数据为1字节,最小访问单位为64字节,并假设第一装载单元的个数为2,则一个第一装载单元在每个时钟周期缓存的有效数据的数量为64字节。最开始的时候,每一个装载单元中没有装载有效数据。在第1个时钟周期,每个第一装载单元可以新增缓存32字节有效数据。因此,在步骤S712中,两个第一装载单元在第1个时钟周期共缓存64字节有效数据。在步骤S716中,两个装载单元在第2个时钟周期共缓存64字节有效数据。假设脉动处理阵列的高度H等于所述最小访问单位对应的数量(即,H=64),且在第1个时钟周期和在第2个时钟周期消耗的输入特征数据的数量均为H,则在第1个时钟周期对应的步骤S712中,从缓存中取出64字节输入特征数据装载到脉动处理单元,在步骤S714中两个第一装载单元没有缓存数据;在第2个时钟周期对应的步骤S716中两个第一装载单元再从缓存中取出64字节输入特征数据装载到脉动处理单元,在步骤S718中两个第一装载单元缓存没有缓存数据。可以看出,两个第一装载单元的缓存速率等于脉动处理阵列的数据消耗速率,因此,脉动处理阵列始终能够获取到足够多的输入特征数据。For the second case, the number of columns of the input feature data is relatively small. As shown in FIG. 7B , assuming that the number of columns of input feature data is 32 bytes, each input feature data is 1 byte, the minimum access unit is 64 bytes, and assuming that the number of first loading units is 2, then one The number of valid data buffered by the first loading unit in each clock cycle is 64 bytes. Initially, no valid data is loaded into each load unit. In the first clock cycle, each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S712, the two first loading units buffer a total of 64 bytes of valid data in the first clock cycle. In step S716, the two loading units buffer a total of 64 bytes of valid data in the second clock cycle. Assuming that the height H of the systolic processing array is equal to the number corresponding to the minimum access unit (ie, H=64), and the number of input feature data consumed in the first clock cycle and in the second clock cycle is H, then In step S712 corresponding to the first clock cycle, the 64-byte input feature data is taken out from the cache and loaded into the pulsation processing unit. In step S714, the two first loading units have no buffered data; in the second clock cycle corresponding to In step S716, the two first loading units take out the 64-byte input feature data from the cache and load them into the pulsation processing unit. In step S718, the two first loading units cache no buffered data. It can be seen that the buffering rate of the two first loading units is equal to the data consumption rate of the systolic processing array. Therefore, the systolic processing array can always obtain enough input feature data.
可以看出,由于采用了多个第一装载单元602,使得每个时钟周期内缓存的输入特征数据的数量成倍地增加了,使得缓存的数据总是能够为数据装载提供足够的数据,提高了脉动处理阵列的处理效率。尤其是在脉动处理阵列高度较大的情况下,通过这种方式能够高效地为脉动处理阵列输入数据,使得每个时钟周期装载的数据量尽可能接近脉动处理阵列的高度,从而使得脉动处理阵列高度维度的扩展不受限制,利 于灵活设计不同大小的高性能脉动处理阵列。It can be seen that, due to the use of multiple first loading units 602, the number of input feature data buffered in each clock cycle is doubled, so that the buffered data can always provide enough data for data loading, improving the The processing efficiency of the systolic processing array is improved. Especially when the height of the systolic processing array is large, in this way, data can be efficiently input to the systolic processing array, so that the amount of data loaded in each clock cycle is as close as possible to the height of the systolic processing array, so that the systolic processing array can be The expansion of the height dimension is not limited, which is conducive to the flexible design of high-performance systolic processing arrays of different sizes.
由于输入特征图的列数一般比较大,而第一装载单元的缓存能力是有限的,因此,在实际应用中,可以将输入特征图分块,得到多个数据块,每个数据块的列数小于输入特征数据的总列数,每个数据块的行数小于或等于输入特征数据的总行数。在这种情况下,每次仅对一个数据块进行缓存和装载,可以在缓存的数据块装载完毕之后再对下一个数据块进行缓存。可选地,所述数据块的列数可以等于所述存储单元的最小访问单元对应的数据个数。例如,所述存储单元的最小访问单元为32字节,每个数据为1字节,则所述数据块的列数为32列。可选地,所述数据块的列数也可以等于所述存储单元的最小访问单元对应的数据个数的整数倍。Since the number of columns of the input feature map is generally relatively large, and the cache capacity of the first loading unit is limited, in practical applications, the input feature map can be divided into blocks to obtain multiple data blocks. The number is less than the total number of columns of the input feature data, and the number of rows of each data block is less than or equal to the total number of rows of the input feature data. In this case, only one data block is cached and loaded at a time, and the next data block can be cached after the cached data block is loaded. Optionally, the number of columns of the data block may be equal to the number of data corresponding to the minimum access unit of the storage unit. For example, if the minimum access unit of the storage unit is 32 bytes, and each data is 1 byte, the number of columns of the data block is 32 columns. Optionally, the number of columns of the data block may also be equal to an integer multiple of the number of data corresponding to the minimum access unit of the storage unit.
在一些实施例中,所述多个第一装载单元中的每个第一装载单元还可以在对输入特征数据进行装载的过程中,对所述输入特征数据进行第一填充。由于每个第一装载单元可以采用相同的方式进行填充,下面以其中一个第一装载单元(称为装载单元A)的填充方式为例进行说明,其他第一装载单元的填充方式可参见该第一装载单元。所述装载单元A可以先获取输入特征数据的描述信息,所述描述信息中可以包括用于表征所述输入特征数据是否需要填充的信息,例如,用“0”或者空值表示不需要填充,用“1”表示需要填充。在需要填充的情况下,还可以包括填充参数,例如,填充数据的数值,以及填充数据的行数和/或列数。In some embodiments, each first loading unit in the plurality of first loading units may further perform first filling on the input feature data during the process of loading the input feature data. Since each first loading unit can be filled in the same way, the following will take the filling method of one of the first loading units (referred to as loading unit A) as an example for description. a loading unit. The loading unit A may first obtain the description information of the input feature data, and the description information may include information used to indicate whether the input feature data needs to be filled, for example, use "0" or a null value to indicate that no filling is required, Use "1" to indicate that padding is required. When padding is required, padding parameters may also be included, for example, the value of padding data, and the number of rows and/or columns of padding data.
装载单元A可以根据所述描述信息确定所述输入特征数据是否需要进行填充,如果需要,在数据装载过程中,可以先判断当前装载的数据是输入特征数据还是填充数据,如果是输入特征数据,则直接从缓存中读取相应的数据进行装载;如果是填充数据,则根据填充参数生成填充数据并装载。在实际应用中,上述第一填充可以包括对所述输入特征数据的左边界和右边界中的至少一者进行填充。对所述输入特征数据的左边界进行填充,是指在所述输入特征数据的第1列之前添加至少一列填充数据;对所述输入特征数据的右边界进行填充,是指在所述输入特征数据的最后1列之后添加至少一列填充数据。The loading unit A can determine whether the input feature data needs to be filled according to the description information, if necessary, in the data loading process, can first determine whether the currently loaded data is input feature data or fill data, if it is input feature data, Then directly read the corresponding data from the cache to load; if it is filling data, generate filling data and load according to filling parameters. In practical applications, the above-mentioned first filling may include filling at least one of the left boundary and the right boundary of the input feature data. Filling the left border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; filling the right border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; Add at least one column to fill the data after the last 1 column of data.
在一些实施例中,所述第一装载单元包括:发送子单元,用于向所述存储单元发送读取指令;缓存子单元,用于对所述存储单元根据所述读取指令返回的输入特征数据进行缓存;以及装载子单元,用于将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中。In some embodiments, the first loading unit includes: a sending subunit, configured to send a read instruction to the storage unit; a cache subunit, configured to respond to the input returned by the storage unit according to the read instruction The feature data is buffered; and a loading subunit is used to load the buffered input feature data into at least one row of processing units in the systolic processing array.
所述发送子单元可以包括第一解析子单元,用于接收装载指令,对所述装载指 令进行解析,以生成待装载的输入特征数据的描述信息;以及第二解析子单元,用于对所述描述信息进行解析,根据解析结果向所述存储单元发送所述读取指令。在一个实施方式中个,第一解析子单元可以自控制器接收装载指令。然而本发明并非限于此。根据本发明的其他实施方式,第一解析子单元也可以自其他装置或存储设备接收装载指令。一个第一装载单元所需的描述信息具体可包括以下部分或者全部:The sending subunit may include a first parsing subunit for receiving a loading instruction and parsing the loading instruction to generate description information of the input feature data to be loaded; and a second parsing subunit for analyzing the loaded instruction. The description information is parsed, and the read instruction is sent to the storage unit according to the analysis result. In one embodiment, the first parsing subunit may receive a loading instruction from the controller. However, the present invention is not limited to this. According to other embodiments of the present invention, the first parsing subunit may also receive loading instructions from other apparatuses or storage devices. The description information required by a first loading unit may specifically include some or all of the following:
待装载的输入特征数据的组数(ifm_num),用于表示该第一装载单元负责装载的输入特征数据一共有多少组。当输入特征数据为输入特征图时,一张输入特征图为一组输入特征数据。The number of groups of input feature data to be loaded (ifm_num), which is used to indicate the total number of groups of input feature data that the first loading unit is responsible for loading. When the input feature data is an input feature map, an input feature map is a set of input feature data.
待装载的第一组输入特征数据的编号(ifm_start),用于表示该第一装载单元负责装载的第一组输入特征数据的标识信息。该第一装载单元负责装载的其余的输入特征数据的编号依次递增,例如,ifm_start等于1,则第二组输入特征数据的编号为2,第三组输入特征数据的编号为3,以此类推。The number (ifm_start) of the first group of input feature data to be loaded, which is used to indicate the identification information of the first group of input feature data that the first loading unit is responsible for loading. The numbers of the remaining input feature data that the first loading unit is responsible for loading increases sequentially. For example, ifm_start is equal to 1, the number of the second group of input feature data is 2, the number of the third group of input feature data is 3, and so on. .
该第一装载单元负责装载的脉动处理阵列中的一个阵列块(一般包括脉动处理阵列中的多行处理单元)能够同时处理的输入特征数据的组数(ifm_num_perb),由于在进行卷积处理的过程中,有时需要将多组输入特征数据与对应卷积核参数的卷积处理结果进行累加,因此需要脉动处理阵列能够同时处理多组输入特征数据。例如输入特征图包括R(Red)通道图、G(Green)通道图和B(Blue)通道图,分别用三组卷积核参数对这三个通道图进行卷积处理,然后将三个通道图的卷积处理进行累加,因此可以用脉动处理阵列中的一个阵列块分别装载三组卷积核参数,并通过第一装载单元分别为这三个阵列块装载上述三个通道图。在这种情况下,ifm_num_perb等于3。The first loading unit is responsible for the number of input feature data groups (ifm_num_perb) that can be processed simultaneously by an array block in the systolic processing array (generally including multiple rows of processing units in the systolic processing array), since the convolution processing is performed during the convolution processing. In the process, it is sometimes necessary to accumulate multiple sets of input feature data and the convolution processing results of the corresponding convolution kernel parameters, so the pulsation processing array needs to be able to process multiple sets of input feature data at the same time. For example, the input feature map includes R (Red) channel map, G (Green) channel map, and B (Blue) channel map. Three sets of convolution kernel parameters are used to convolve the three channel maps, and then the three channel maps are convolved. The convolution processing of the graph is accumulated, so one array block in the systolic processing array can be used to load three sets of convolution kernel parameters respectively, and the first loading unit can respectively load the above three channel graphs for these three array blocks. In this case, ifm_num_perb is equal to 3.
整个脉动处理阵列同时处理的输入特征数据的组数(ifm_num_perb_total),即,每个第一装载单元对应的ifm_num_perb之和。The number of groups of input feature data (ifm_num_perb_total) processed simultaneously by the entire systolic processing array, that is, the sum of ifm_num_perb corresponding to each first loading unit.
卷积核的高度(kernel_h),表示该第一装载单元负责装载的阵列块中卷积核参数的高度。应当说明的是,同一个第一装载单元负责装载的阵列块中的卷积核可能包括一组完整的卷积核参数中的所有行,也可能只包括一个完整的卷积核中的部分行。例如,一组卷积核参数的尺寸为3×3,而一个第一装载单元负责装载的阵列块的高度可能是5,这5行阵列块中可能包括第一组卷积核参数中的三行数据,以及第二组卷积核参数中的两行数据。在一些实施例中,所述第一装载单元包括的缓存子单元的深度可以与所述第一装载单元负责装载的阵列块的深度相等。The height of the convolution kernel (kernel_h), indicating the height of the convolution kernel parameter in the array block that the first loading unit is responsible for loading. It should be noted that the convolution kernels in the array block that the same first loading unit is responsible for loading may include all rows in a complete set of convolution kernel parameters, or may only include some rows in a complete convolution kernel. . For example, the size of a set of convolution kernel parameters is 3 × 3, and the height of the array block that a first loading unit is responsible for loading may be 5, and these 5 rows of array blocks may include three of the first set of convolution kernel parameters. row data, and two rows of data in the second set of convolution kernel parameters. In some embodiments, the depth of the buffer subunits included in the first loading unit may be equal to the depth of the array block that the first loading unit is responsible for loading.
该第一装载单元负责装载的阵列块的起始位置(kernel_h_start),用于表示该第一装载单元负责装载的第一行处理单元在脉动处理单元中是第几行。The starting position (kernel_h_start) of the array block that the first loading unit is responsible for loading is used to indicate the row number of the first row processing unit that the first loading unit is responsible for loading in the pulsating processing unit.
待装载的输入特征数据的基地址(ifm_baddr),用于表示该第一装载单元负责装载的输入特征数据在存储单元中的首地址。The base address (ifm_baddr) of the input feature data to be loaded, which is used to indicate the first address in the storage unit of the input feature data that the first loading unit is responsible for loading.
待装载的每组输入特征数据所占用存储空间的大小(ifm_len),用于寻址每组输入特征数据。例如第二组输入特征数据的地址为基地址与ifm_len对应的位长之和。The size (ifm_len) of the storage space occupied by each set of input feature data to be loaded, used for addressing each set of input feature data. For example, the address of the second group of input feature data is the sum of the bit lengths corresponding to the base address and ifm_len.
待装载的输入特征数据的宽度(ifm_w),表示一组输入特征数据的总列数。当宽度大于存储单元的最小访问单位时,可分多次装载输入特征数据中的同一行数据。The width of the input feature data to be loaded (ifm_w), indicating the total number of columns of a set of input feature data. When the width is greater than the minimum access unit of the storage unit, the same row of data in the input feature data can be loaded in multiple times.
根据实际需要,描述信息中可以包括以上至少一种信息,还可以包括以上信息之外的其他信息,此处不再赘述。描述信息用于使第一装载单元知道如何从存储单元中获取输入特征数据。According to actual needs, the description information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here. The description information is used to let the first loading unit know how to obtain the input feature data from the storage unit.
在一些实施例中,所述第二解析子单元还用于:生成辅助装载信息,所述辅助装载信息用于确定所述输入特征数据的装载方式。所述辅助装载信息可以包括以下部分或全部信息:In some embodiments, the second parsing subunit is further configured to: generate auxiliary loading information, where the auxiliary loading information is used to determine a loading mode of the input feature data. The auxiliary loading information may include some or all of the following information:
卷积核参数在该第一装载单元负责装载的输入特征数据的列方向上的滑动步长(stride_h),用于表示进行卷积处理时,卷积核每次在输入特征数据上向下滑动多少行,stride_h等于1时,向下滑动1行,stride_h等于2时,向下滑动2行,以此类推。该参数可以使第一装载单元知道需要将输入特征数据中的哪些行装载到脉动处理阵列中。The sliding step size (stride_h) of the convolution kernel parameter in the column direction of the input feature data that the first loading unit is responsible for loading, which is used to indicate that when performing convolution processing, the convolution kernel slides down on the input feature data each time How many lines, when stride_h is equal to 1, slide down by 1 line, when stride_h is equal to 2, slide down by 2 lines, and so on. This parameter allows the first loading unit to know which rows of input feature data need to be loaded into the systolic processing array.
空洞卷积核的高度(dilate_h),用于表示与卷积核参数进行卷积处理的输入特征数据的行数的间隔。例如,当卷积核为3×3时,如果dilate_h为1,所述间隔为1,即,输入特征数据中的每行(例如,第1行,第2行和第3行)数据均分别与卷积核参数中的3行数据进行卷积处理;如果dilate_h为2,所述间隔为2,即,输入特征数据中的隔行(例如,第1行,第3行和第5行)数据分别与卷积核参数中的3行数据进行卷积处理。The height of the dilated convolution kernel (dilate_h), which is used to indicate the interval of the number of lines of the input feature data convolved with the convolution kernel parameters. For example, when the convolution kernel is 3×3, if dilate_h is 1, the interval is 1, that is, each row (eg, row 1, row 2, and row 3) in the input feature data is divided into Convolve with 3 lines of data in the convolution kernel parameter; if dilate_h is 2, the interval is 2, that is, the interlaced (e.g., line 1, line 3, and line 5) data in the input feature data Convolve with the 3 lines of data in the convolution kernel parameters, respectively.
输入特征数据的高度(ifm_h),用于表示一组输入特征数据的总行数。The height of the input feature data (ifm_h), used to represent the total number of rows of a set of input feature data.
输出特征数据的高度(ofm_h),用于表示一组输出特征数据的总行数。The height of the output feature data (ofm_h), used to represent the total number of rows of a set of output feature data.
对输入特征数据的上边界进行填充的填充数据行数(pad_t)。The number of padding data rows (pad_t) to pad the upper boundary of the input feature data.
对输入特征数据的下边界进行填充的填充数据行数(pad_b)。The number of padding data rows (pad_b) to pad the lower boundary of the input feature data.
对输入特征数据的左边界进行填充的填充数据行数(pad_l)。The number of padding data rows (pad_l) to pad the left border of the input feature data.
对输入特征数据的右边界进行填充的填充数据行数(pad_r)。The number of padding data rows (pad_r) to pad the right edge of the input feature data.
上述pad_t、pad_b、pad_l和pad_r均称为填充参数。根据实际需要,辅助装载信息中可以包括以上至少一种信息,还可以包括以上信息之外的其他信息,此处不再赘述。在一些实施例中,还可以根据所述辅助装载信息对缓存的输入特征数据进行重排,以使所述输入特征数据的缓存方式与装载方式相匹配,从而提高数据装载效率。例如,通过重排可以将装载给不同行处理单元的输入特征数据缓存在缓存子单元的不同缓存地址中,或者可根据装载需要从缓存的输入特征数据中筛选出装载所需的数据。The above pad_t, pad_b, pad_l and pad_r are all called padding parameters. According to actual needs, the auxiliary loading information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here. In some embodiments, the cached input feature data may also be rearranged according to the auxiliary loading information, so that the cache mode of the input feature data matches the loading mode, thereby improving data loading efficiency. For example, input feature data loaded to different line processing units can be cached in different cache addresses of cache subunits through rearrangement, or data required for loading can be filtered from the cached input feature data according to loading needs.
在一些实施例中,所述缓存子单元包括:第一缓存子单元,用于对所述辅助装载信息进行缓存;以及第二缓存子单元,用于从所述第一缓存子单元中读取所述辅助装载信息,并根据上述辅助装载信息对根据所述读取指令返回的输入特征数据进行缓存。其中,所述第一缓存子单元可以是一个先进先出(First In First Out,FIFO)队列。In some embodiments, the cache subunit includes: a first cache subunit for caching the auxiliary loading information; and a second cache subunit for reading from the first cache subunit the auxiliary loading information, and cache the input feature data returned according to the read instruction according to the auxiliary loading information. Wherein, the first cache subunit may be a First In First Out (First In First Out, FIFO) queue.
在一些实施例中,所述第二缓存子单元包括:第三缓存子单元,用于对所述存储单元根据所述读取指令返回的输入特征数据进行缓存;读写子单元,用于从所述第一缓存子单元中读取所述辅助装载信息,根据所述辅助装载信息对所述第三缓存子单元缓存的输入特征数据进行重排,并将重排后的输入特征数据写入第四缓存子单元;以及第四缓存子单元,用于对重排后的输入特征数据进行缓存,以供所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中。其中,所述第三缓存子单元可以是一个FIFO队列,所述第四缓存子单元可以是一个随机存取存储器(Random Access Memory,RAM)。In some embodiments, the second cache subunit includes: a third cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction; The auxiliary loading information is read in the first cache subunit, the input feature data buffered by the third cache subunit is rearranged according to the auxiliary loading information, and the rearranged input feature data is written into a fourth buffering subunit; and a fourth buffering subunit for buffering the rearranged input feature data for the loading subunit to load the rearranged input feature data into the systolic processing array middle. Wherein, the third buffer subunit may be a FIFO queue, and the fourth buffer subunit may be a random access memory (Random Access Memory, RAM).
在一些实施例中,所述读写子单元还用于:在对所述第一缓存子单元缓存的输入特征数据进行重排的过程中,对所述输入特征数据进行第二填充。所述第二填充包括对所述输入特征数据的上边界和下边界中的至少一者进行填充。In some embodiments, the read-write subunit is further configured to: in the process of rearranging the input feature data buffered by the first cache subunit, perform a second filling on the input feature data. The second padding includes padding at least one of an upper boundary and a lower boundary of the input feature data.
本公开实施例的第一装载单元的结构示意图如图8所示。其中,第一解析子单元802(图中记为IFM_SBLK_INFO)用于接收指令,按照脉动处理阵列对输入特征数据的需求,解析所述指令,生成所述描述信息给到第二解析子单元804(图中记为IFM_SBLK_RD)。A schematic structural diagram of the first loading unit according to the embodiment of the present disclosure is shown in FIG. 8 . Wherein, the first parsing subunit 802 (denoted as IFM_SBLK_INFO in the figure) is used to receive an instruction, parse the instruction according to the requirements of the pulsating processing array for input feature data, and generate the description information to the second parsing subunit 804 ( The figure is marked as IFM_SBLK_RD).
第二解析子单元804用于解析所述描述信息,向存储单元发送读输入特征数据 的指令,读回的输入特征数据写入第三缓存子单元806(图中记为IFM_FIFO)。并且在解析过程中,生成辅助装载信息到第一缓存子单元814(图中记为INFO_FIFO)。The second parsing subunit 804 is used for parsing the description information, sending an instruction to read the input feature data to the storage unit, and the read-back input feature data is written into the third buffer subunit 806 (referred to as IFM_FIFO in the figure). And during the parsing process, auxiliary loading information is generated to the first buffer subunit 814 (indicated as INFO_FIFO in the figure).
读写子单元(RD_FIFO_WR_RAM)808用于按照第二解析子单元804生成的辅助装载的信息,从第三缓存子单元806中读取数据,装载到第四缓存子单元812(图中记为IFM_RAM)。此过程同时完成上边界填充和下边界填充。The read-write subunit (RD_FIFO_WR_RAM) 808 is used to read data from the third cache subunit 806 according to the auxiliary loading information generated by the second analysis subunit 804, and load it into the fourth cache subunit 812 (denoted as IFM_RAM in the figure). ). This process completes both upper and lower boundary padding.
装载子单元810(图中记为WR_IFM_2MAC)用于按照脉动处理阵列对输入特征数据的需求,从第四缓存子单元812中读取数据,依次送给脉动处理阵列,此过程同时完成左边界填充和右边界填充。The loading subunit 810 (marked as WR_IFM_2MAC in the figure) is used to read data from the fourth buffer subunit 812 according to the requirements of the pulsating processing array for input feature data, and send it to the pulsating processing array in turn. This process completes the left boundary filling at the same time. and right border padding.
第三缓存子单元806用于存储从存储单元读出的输入特征数据。也就是说,第三缓存子单元806对存储单元发送的输入特征数据进行缓存,并将缓存于第三缓存子单元中的输入特征数据装载至第四缓存子单元812中。第三缓存子单元806的宽度和存储单元的数据端口位宽一致,也即存储单元的最小访问单元。The third cache subunit 806 is used for storing the input feature data read out from the storage unit. That is, the third buffering subunit 806 buffers the input feature data sent by the storage unit, and loads the input feature data buffered in the third buffering subunit into the fourth buffering subunit 812 . The width of the third cache sub-unit 806 is consistent with the bit width of the data port of the storage unit, that is, the smallest access unit of the storage unit.
第三缓存子单元806用于存储读取的输入特征数据并装载读取的输入特征数据至第四缓存子单元812的信息。The third cache subunit 806 is used for storing the read input feature data and loading the read input feature data to the information of the fourth cache subunit 812 .
在一些实施例中,所述第四缓存子单元的数量为多个,多个第四缓存子单元依次对所述重排后的输入特征数据进行缓存;所述装载子单元用于依次将所述多个第四缓存子单元中的每个第四缓存子单元中的输入特征数据装载到所述脉动处理阵列中。在缓存子单元的数量为多个的情况下,每个缓存子单元可以分别采用一个第四缓存子单元,其余功能单元可以共用。多个第四缓存子单元中的每个第四缓存子单元在前一个第四缓存子单元缓存完成的情况下开始对所述重排后的输入特征数据进行缓存。In some embodiments, the number of the fourth cache sub-units is multiple, and the multiple fourth cache sub-units sequentially cache the rearranged input feature data; the loading sub-unit is used to sequentially The input feature data in each of the plurality of fourth buffer sub-units is loaded into the systolic processing array. In the case where there are multiple cache subunits, each cache subunit may use a fourth cache subunit respectively, and the remaining functional units may share the same. Each fourth cache subunit of the plurality of fourth cache subunits starts to cache the rearranged input feature data when the previous fourth cache subunit is cached.
进一步地,所述多个第四缓存子单元中的每个第四缓存子单元还可以包括多个第五缓存子单元,所述第四缓存子单元包括的多个第五缓存子单元中每个第五缓存子单元的高度等于所述第四缓存子单元的高度,每个第五缓存子单元的宽度等于从所述存储单元中读取的输入特征数据的位宽。Further, each of the fourth cache subunits in the plurality of fourth cache subunits may also include a plurality of fifth cache subunits, and each of the plurality of fifth cache subunits included in the fourth cache subunit. The height of the fifth cache subunit is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.
在一些实施例中,如图9所示,所述第四缓存子单元(即图中的IFM_RAM)的数量为3,分别称为缓存子单元902(图中记为ping_ram)、缓存子单元904(图中记为pong_ram)和缓存子单元906(图中记为peng_ram)。所述第四缓存子单元按照脉动处理阵列对输入特征数据的需求,存储所述输入特征数据。其中ping-pong-peng三组相同的IFM_RAM可以流水工作,加速装载脉动处理阵列。每一组第四缓存子单 元的深度等于脉动处理阵列中一个阵列块的高度h,并且每一组第四缓存子单元包含若干个深度是h的小RAM(例如,ping_ram0),其宽度与输入特征数据的位宽一致。本公开使用8比特进行说明,但本公开也适用于其他位宽。对于不同的应用场景,不一定实例化三组IFM_RAM。例如,对性能要求低时,可以仅实例化一组IFM_RAM;又例如,对性能要求高时,可以实例化更多组IFM_RAM,直到增加IFM_RAM不能再提高性能为止。In some embodiments, as shown in FIG. 9 , the number of the fourth cache subunits (that is, the IFM_RAM in the figure) is 3, which are respectively called the cache subunits 902 (denoted as ping_ram in the figure) and the cache subunits 904 (denoted as pong_ram in the figure) and cache subunit 906 (denoted as peng_ram in the figure). The fourth cache subunit stores the input characteristic data according to the requirement of the systolic processing array for the input characteristic data. Among them, ping-pong-peng three groups of the same IFM_RAM can work in pipeline to speed up the loading of the pulsating processing array. The depth of each group of fourth cache subunits is equal to the height h of one array block in the systolic processing array, and each group of fourth cache subunits includes several small RAMs (eg, ping_ram0) of depth h, the width of which is the same as that of the input The bit width of the characteristic data is the same. The present disclosure uses 8 bits for illustration, but the present disclosure is applicable to other bit widths as well. For different application scenarios, it is not necessary to instantiate three sets of IFM_RAM. For example, when performance requirements are low, only one group of IFM_RAM may be instantiated; for another example, when performance requirements are high, more groups of IFM_RAM may be instantiated until increasing IFM_RAM can no longer improve performance.
各个第一装载单元之间可通过同步信号out_sync进行同步,该信号标识各第一装载单元中相应的缓存子单元902、缓存子单元904、缓存子单元906是否已经装载好数据准备输出到脉动处理阵列。只有当各第一装载单元中的缓存子单元902、缓存子单元904或缓存子单元906装载好数据,选择单元910才会从存储在缓存子单元902、缓存子单元904、或缓存子单元906中选择对应的数据,并向脉动处理阵列输出这一轮数据。在一实施方式中,多个第一装载单元中的每一个能够依次从所述存储单元读取所述输入特征数据中的多个数据块,在对所述多个数据块中的至少一个数据块缓存完毕的情况下,能够将所述至少一个数据块装载到所述脉动处理阵列中的阵列块中。Each first loading unit can be synchronized through a synchronization signal out_sync, which identifies whether the corresponding cache subunit 902, cache subunit 904, and cache subunit 906 in each first loading unit have loaded data and are ready to be output to the pulsating process array. Only when the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 in each of the first loading units is loaded with data, the selection unit 910 will save the data from the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 Select the corresponding data in the systolic processing array, and output this round of data to the systolic processing array. In one embodiment, each of the plurality of first loading units is capable of sequentially reading a plurality of data blocks in the input feature data from the storage unit, and at least one data block in the plurality of data blocks is When the block cache is completed, the at least one data block can be loaded into an array block in the systolic array.
如图10所示,是本公开实施例的数据处理装置的处理流程图。应当说明的是,该处理流程仅为本公开的数据处理装置的一种可能的处理方式的示意图,并不用于限制本公开。As shown in FIG. 10 , it is a processing flow chart of the data processing apparatus according to the embodiment of the present disclosure. It should be noted that this processing flow is only a schematic diagram of a possible processing manner of the data processing apparatus of the present disclosure, and is not used to limit the present disclosure.
步骤1002:接收指令,所述指令中可携带上述至少一种描述信息。Step 1002: Receive an instruction, where the instruction can carry at least one of the above-mentioned description information.
步骤1004:解析指令,获取所述描述信息。Step 1004: Parse the instruction to obtain the description information.
步骤1006:按照存储单元的最小访问单位读取行数据。即,从第一张输入特征图ifm_start中关于卷积核高度起始位置kernel_h_start对应的行开始,每行访问一次最小访问单位,直到访问完一个卷积核对应的kernel_h行数据,继续访问下个卷积核对应的下一组输入特征数据的kernel_h行数据,每行访问一次最小访问单位。其中,kernel_h为卷积核高度。Step 1006: Read row data according to the minimum access unit of the storage unit. That is, starting from the line corresponding to the convolution kernel height starting position kernel_h_start in the first input feature map ifm_start, each line accesses the minimum access unit once until the kernel_h line data corresponding to one convolution kernel is accessed, and continues to access the next The kernel_h row data of the next set of input feature data corresponding to the convolution kernel, each row is accessed once the minimum access unit. Among them, kernel_h is the height of the convolution kernel.
步骤1008:按照步骤1006的步骤,完成阵列块一次映射输入特征图的数量ifm_num_perb个输入特征数据对应的行访问,每次行访问对应一次最小访问单位。在一个实施方式中,ifm_num_perb为阵列块一次映射对应的输入特征图的数量。Step 1008 : According to the steps of Step 1006 , complete the row access corresponding to the number of ifm_num_perb input feature data mapped to the input feature map of the array block at one time, and each row access corresponds to a minimum access unit. In one embodiment, ifm_num_perb is the number of input feature maps corresponding to one mapping of the array block.
步骤1010:循环执行步骤1006和步骤1008,每行访问一次最小访问单位,直到本次的最小访问单位可以覆盖整行,则完成了一张输入特征图中kernel_h对应的卷 积核滑动步长stride_h和空洞卷积核高度dilate_h的一次输入特征数据装载。卷积核参数在所述输入特征数据上,一共需要移动输出特征图高度ofm_h次。Step 1010: Execute step 1006 and step 1008 in a loop, visit the minimum access unit once per row, until the current minimum access unit can cover the entire row, then the convolution kernel sliding step stride_h corresponding to kernel_h in an input feature map is completed. One input feature data loading with atrous convolution kernel height dilate_h. The convolution kernel parameters need to move the output feature map height ofm_h times in total on the input feature data.
步骤1012:循环执行步骤1006至步骤1010,完成输出特征图高度ofm_h次卷积核参数在输入特征数据上的移动扫描,至此,完成了一次ifm_num_perb对应的所有输入特征数据的装载。Step 1012: Circularly execute steps 1006 to 1010 to complete the moving scan of the convolution kernel parameters of the output feature map height ofm_h times on the input feature data. So far, the loading of all input feature data corresponding to ifm_num_perb is completed.
步骤1014:循环执行步骤1006至步骤1012,以阵列块一次映射输入特征图的数量ifm_num_perb为单位,完成所有输入特征图ifm的装载。Step 1014 : Steps 1006 to 1012 are executed cyclically to complete the loading of all the input feature maps ifm with the number ifm_num_perb of the input feature maps mapped at one time by the array block.
步骤1016:结束。Step 1016: End.
在一些实施例中,所述脉动处理阵列的列数为3的整数倍。由于SRAM等存储单元的最小访问单位是2的n次方字节,所以业界绝大多数脉动处理阵列的维度也是2的n次方字节。但是当前的卷积神经网络中,尺寸是3×3的卷积核在各种尺寸的卷积核中占比较高,3×3的卷积核映射到2的n次方字节维度的脉动处理阵列中时,会有一些处理单元无法映射到,造成了脉动处理阵列资源的浪费。而将脉动序列阵列的列数设计为3的整数倍,则能够在大部分情况下有效减少资源浪费情况,从而提高脉动处理阵列的处理效率。In some embodiments, the number of columns of the systolic processing array is an integer multiple of three. Since the minimum access unit of storage units such as SRAM is 2 to the nth power of bytes, the dimensions of most of the systolic processing arrays in the industry are also 2 to the nth power of bytes. However, in the current convolutional neural network, the convolution kernel of size 3 × 3 accounts for a high proportion of the convolution kernels of various sizes, and the 3 × 3 convolution kernel is mapped to the pulsation of the n-th power byte dimension of 2 When processing the array, some processing units cannot be mapped to, resulting in a waste of pulsating processing array resources. However, designing the number of columns of the systolic sequence array to be an integer multiple of 3 can effectively reduce resource waste in most cases, thereby improving the processing efficiency of the systolic processing array.
举例来说,在2的n次方维度的脉动处理阵列中,当n=5时,脉动处理阵列的尺寸就是32×32(即脉动处理阵列的每行和每列均有32个处理单元),这也是业界常见的1K规模CNN加速器的脉动处理阵列大小,但对于3×3的卷积核,这样的脉动处理阵列,总有两行和两列无法用到,造成了宝贵的脉动处理阵列计算资源的浪费。如图11A所示,每个小方块表示一个处理单元1106,多个第一装载单元1102分别装载输入特征图IFM0至输入特征图IFM9,多个第二装载单元1104分别装载权重Weight 0至权重Weight 9。其中,多个第一装载单元1102在不同时刻将各自对应的输入特征图传送至对应的多个处理单元106。所述对应的多个处理单元位于一个阵列块中。也就是说,多个第一装载单元1102在不同时刻将各自对应的输入特征图传送至对应阵列块。例如,在t0时刻,装载有输入特征图IFM0的第一装载单元1102将输入特征图IFM0传送至与该第一装载单元1102相邻的n行处理单元1106,其中,n为行数。优选地,n与卷积核的尺寸相对应。在一个实施方式中,当卷积核的尺寸为3×3的话,n等于3。也就是说,n等于卷积核的宽度或者长度。需要说明的是,位置上相邻的两个第一装载单元1102将各自对应的特征图传送至对应的处理单元1106的时刻间隔一个时钟周期。另外,多个第二装载单元1104在不同时刻将各自对应的权重传送至处理 单元1106。例如,在t0时刻,装载有权重Weight 0的第二装载单元1104将权重Weight 0传送至与该第二装载单元1104相邻的n行处理单元1106,其中,n为行数。优选地,n与卷积核的尺寸相对应。在一个实施方式中,当卷积核的尺寸为3×3的话,n等于3。也就是说,n等于卷积核的宽度或者长度。需要说明的是,位置上相邻的两个第一装载单元1104将各自对应的特征图传送至对应的处理单元1106的时刻间隔一个时钟周期。如果卷积核的尺寸为3×3的话(即,kij(1≤i≤3,1≤j≤3)),上述脉动处理阵列中的每三行处理单元1106最多能装载10组卷积核参数。然而,该脉动处理阵列的最后两行和最后两列则无法用于装载上述3×3的卷积核,因此这部分处理单元1106空闲,用“0”所示,并且对应的第一装载单元和第二装载单元的状态也为空闲。因此,理想情况下,脉动处理阵列的利用率是30/32=93.75%。For example, in the systolic processing array of 2 to the nth power, when n=5, the size of the systolic processing array is 32×32 (that is, each row and each column of the systolic processing array has 32 processing units) , which is also the size of the pulsation processing array of the common 1K scale CNN accelerator in the industry, but for a 3×3 convolution kernel, there are always two rows and two columns of such a pulsation processing array that cannot be used, resulting in a valuable pulsation processing array. Waste of computing resources. As shown in FIG. 11A , each small square represents a processing unit 1106, a plurality of first loading units 1102 respectively load input feature maps IFM0 to input feature maps IFM9, and a plurality of second loading units 1104 respectively load weights Weight 0 to weight Weight 9. The plurality of first loading units 1102 transmit respective corresponding input feature maps to the corresponding plurality of processing units 106 at different times. The corresponding plurality of processing units are located in one array block. That is, the plurality of first loading units 1102 transmit respective corresponding input feature maps to corresponding array blocks at different times. For example, at time t0, the first loading unit 1102 loaded with the input feature map IFM0 transmits the input feature map IFM0 to the n-row processing unit 1106 adjacent to the first loading unit 1102, where n is the number of rows. Preferably, n corresponds to the size of the convolution kernel. In one embodiment, n is equal to 3 when the size of the convolution kernel is 3×3. That is, n is equal to the width or length of the convolution kernel. It should be noted that the time when the two adjacent first loading units 1102 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle. In addition, the plurality of second loading units 1104 transmit their corresponding weights to the processing unit 1106 at different times. For example, at time t0, the second loading unit 1104 loaded with the weight Weight 0 transmits the weight Weight 0 to the n-row processing unit 1106 adjacent to the second loading unit 1104, where n is the number of rows. Preferably, n corresponds to the size of the convolution kernel. In one embodiment, n is equal to 3 when the size of the convolution kernel is 3×3. That is, n is equal to the width or length of the convolution kernel. It should be noted that the time when the two adjacent first loading units 1104 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle. If the size of the convolution kernel is 3×3 (ie, kij(1≤i≤3, 1≤j≤3)), every three rows of processing units 1106 in the above-mentioned systolic processing array can load up to 10 groups of convolution kernels parameter. However, the last two rows and the last two columns of the systolic processing array cannot be used to load the above-mentioned 3×3 convolution kernel, so this part of the processing unit 1106 is idle, indicated by “0”, and the corresponding first loading unit and the status of the second loading unit is also idle. Therefore, ideally, the utilization of the systolic processing array is 30/32 = 93.75%.
如果按照3的整数倍的维度来设计脉动处理阵列的尺寸,可以将上述例子中的脉动处理阵列尺寸由32×32优化为33×33,这样3×3的卷积核就能把33×33个处理单元全部利用。请参见图11B,图11B是根据本公开实施例的脉动处理阵列的长度的示意图。需要说明的是,图11B中与图11A相同的元件具有与图11A中元件相同或相似的功能。如图11B所示,仍以图中的每个小方块表示一个处理单元1106,则上述脉动处理阵列中的每组3×3的处理单元1106均可以用于装载一个3×3的卷积核,每个处理单元1106中均可装载一个卷积核参数kij,不存在空闲的处理单元1106。因此,理想情况下,脉动处理阵列的利用率是100%。If the size of the systolic processing array is designed according to the dimension of an integer multiple of 3, the size of the systolic processing array in the above example can be optimized from 32×32 to 33×33, so that the 3×3 convolution kernel can convert 33×33 All processing units are utilized. Please refer to FIG. 11B , which is a schematic diagram of the length of a systolic processing array according to an embodiment of the present disclosure. It should be noted that the same elements in FIG. 11B as those in FIG. 11A have the same or similar functions as those in FIG. 11A . As shown in FIG. 11B , each small square in the figure still represents a processing unit 1106, then each group of 3×3 processing units 1106 in the above-mentioned systolic processing array can be used to load a 3×3 convolution kernel , one convolution kernel parameter kij can be loaded in each processing unit 1106, and there is no idle processing unit 1106. Therefore, ideally, the utilization of the systolic processing array is 100%.
上述数据处理装置可以是处理芯片,例如,SoC芯片,也可以是计算机设备。图12示出了本说明书实施例所提供的一种更为具体的数据处理装置的硬件结构示意图,该设备可以包括:处理器1202、存储器1204、输入/输出接口1206、通信接口1208和总线1210。其中处理器1202、存储器1204、输入/输出接口1206和通信接口1208通过总线1210实现彼此之间在设备内部的通信连接。The above-mentioned data processing apparatus may be a processing chip, for example, an SoC chip, or may be a computer device. FIG. 12 shows a schematic diagram of the hardware structure of a more specific data processing apparatus provided by an embodiment of this specification. The apparatus may include: a processor 1202 , a memory 1204 , an input/output interface 1206 , a communication interface 1208 and a bus 1210 . The processor 1202 , the memory 1204 , the input/output interface 1206 and the communication interface 1208 realize the communication connection among each other within the device through the bus 1210 .
处理器1202可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。The processor 1202 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.
存储器1204可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1204可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施 例所提供的技术方案时,相关的程序代码保存在存储器1204中,并由处理器1202来调用执行。The memory 1204 can be implemented in the form of ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, and the like. The memory 1204 can store the operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1204 and invoked by the processor 1202 for execution.
输入/输出接口1206用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1206 is used for connecting input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.
通信接口1208用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The communication interface 1208 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).
总线1210包括一通路,在设备的各个组件(例如处理器1202、存储器1204、输入/输出接口1206和通信接口1208)之间传输信息。The bus 1210 includes a path that transfers information between the various components of the device (eg, the processor 1202, the memory 1204, the input/output interface 1206, and the communication interface 1208).
需要说明的是,尽管上述设备仅示出了处理器1202、存储器1204、输入/输出接口1206、通信接口1208以及总线1210,但是在具体实施过程中,该设备还可包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解,上述设备中也可以仅包含实现本说明书实施例方案所必需的组件,不必包含图中所示的全部组件。It should be noted that, although the above-mentioned device only shows the processor 1202, the memory 1204, the input/output interface 1206, the communication interface 1208 and the bus 1210, in the specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, and does not necessarily include all the components shown in the figures.
如图14所示,本公开实施例还提供一种数据处理系统1400,可包括上述任一实施例的数据处理装置(例如,数据处理装置1406),以及脉动处理阵列1408,用于装载所述输入特征数据和卷积核参数,并对所述输入特征数据和所述卷积核参数进行卷积处理。As shown in FIG. 14, an embodiment of the present disclosure further provides a data processing system 1400, which may include the data processing apparatus (eg, data processing apparatus 1406) of any of the above-mentioned embodiments, and a systolic processing array 1408 for loading the data processing apparatus 1408. Input feature data and convolution kernel parameters, and perform convolution processing on the input feature data and the convolution kernel parameters.
在一些实施例中,所述系统还包括:第二装载单元1404,用于将所述卷积核参数装载到所述脉动处理阵列中。In some embodiments, the system further includes: a second loading unit 1404 for loading the convolution kernel parameters into the systolic processing array.
在一些实施例中,所述系统还包括:存储单元,用于存储所述输入特征数据。In some embodiments, the system further includes: a storage unit for storing the input feature data.
在一些实施例中,如图13所示,所述存储单元1302包括:多个相互独立的存储子单元,每个存储子单元用于存储所述输入特征数据中的部分数据;所述多个第一装载单元用于在同一时刻访问不同的存储子单元,以获取对应存储子单元存储的输入特征数据。In some embodiments, as shown in FIG. 13 , the storage unit 1302 includes: a plurality of mutually independent storage subunits, each storage subunit is used to store part of the data in the input feature data; the plurality of storage subunits The first loading unit is configured to access different storage subunits at the same time to acquire input feature data stored in the corresponding storage subunits.
进一步地,所述存储单元1302还包括调度单元1304,用于接收所述多个第一装载单元的访问请求,并将所述访问请求发送至对应的存储子单元1306,以使所述多个第一装载单元访问对应的存储子单元1306。Further, the storage unit 1302 further includes a scheduling unit 1304, configured to receive the access requests of the multiple first loading units, and send the access requests to the corresponding storage sub-units 1306, so that the multiple first loading units The first load unit accesses the corresponding storage subunit 1306 .
在一些实施例中,所述系统还包括:输出单元,用于获取所述脉动处理阵列输出的处理结果,并对所述处理结果进行存储,或者将输出所述处理结果。In some embodiments, the system further includes: an output unit, configured to acquire the processing result output by the systolic processing array, and store the processing result, or output the processing result.
在一些实施例中,所述系统可以基于现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)或者转用集成芯片(Application Specific Integrated Circuit,ASIC)实现。In some embodiments, the system may be implemented based on a Field Programmable Gate Array (FPGA) or an integrated chip (Application Specific Integrated Circuit, ASIC).
如图15A和图15B所示,本公开实施例还提供一种神经网络加速器1500,其特征在于,所述神经网络加速器包括本公开任一实施例所述的数据处理装置1502,或者包括本公开任一实施例所述的数据处理系统1504。As shown in FIG. 15A and FIG. 15B , an embodiment of the present disclosure further provides a neural network accelerator 1500, characterized in that, the neural network accelerator includes the data processing apparatus 1502 described in any embodiment of the present disclosure, or includes the present disclosure The data processing system 1504 of any embodiment.
在一些实施例中,所述神经网络加速器为CNN加速器或者循环神经网络(Recurrent Neural Network,RNN)加速器。In some embodiments, the neural network accelerator is a CNN accelerator or a Recurrent Neural Network (RNN) accelerator.
如图16所示,本公开还提供一种数据处理方法,应用于包括多个第一装载单元的数据处理装置中的每个第一装载单元,以将输入特征数据装载到脉动处理阵列中,所述方法包括:As shown in FIG. 16 , the present disclosure also provides a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array, The method includes:
步骤1602:通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取输入特征数据;Step 1602: Access a storage unit in parallel through each of the plurality of first loading units, and read input feature data from the storage unit;
步骤1604:对读取的所述输入特征数据进行缓存;以及Step 1604: Cache the read input feature data; and
步骤1606:将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。Step 1606: Load the buffered input feature data into at least one row of processing units in the systolic processing array.
在一些实施例中,所述多个第一装载单元对所述输入特征数据中的有效数据的缓存速率之和大于或等于所述脉动处理阵列的装载速率。In some embodiments, the sum of the buffering rates of valid data in the input feature data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
在一些实施例中,每个时钟周期内所述输入特征数据的缓存速率之和均大于或等于所述脉动处理阵列在该时钟周期内的装载速率;或者所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的平均装载速率;或者所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的最大装载速率。In some embodiments, the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or the sum of the average buffering rates of the input feature data The sum is greater than or equal to the average loading rate of the systolic processing array; or the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array.
在一些实施例中,所述方法还包括:在对所述输入特征数据进行装载的过程中,对所述输入特征数据进行第一填充。In some embodiments, the method further includes: in the process of loading the input feature data, performing a first filling on the input feature data.
在一些实施例中,所述对所述输入特征数据进行第一填充,包括:对所述输入特征数据的左边界和右边界中的至少一者进行填充。In some embodiments, the performing the first padding on the input feature data includes: padding at least one of a left border and a right border of the input feature data.
在一些实施例中,所述通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取输入特征数据,包括:每次从所述存储单元读取所述输入特征数据中的一个数据块;所述将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元,包括:在对所述数据块缓存完毕的情况下,将所述数据块装载到所述脉动处理阵列中的至少一行处理单元中。In some embodiments, the accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit, includes: reading the input feature data from the storage unit each time The unit reads a data block in the input feature data; the loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes: when the data block is cached, The data block is loaded into at least one row of processing units in the systolic processing array.
在一些实施例中,所述数据块的列数或行数等于所述存储单元的最小访问单位对应的数据个数。In some embodiments, the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
在一些实施例中,所述多个第一装载单元中的每个第一装载单元包括发送子单元,缓存子单元和装载子单元;所述通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取输入特征数据,包括:通过所述发送子单元,向所述存储单元发送读取指令;通过所述缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存;以及通过所述装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中。In some embodiments, each of the plurality of first loading units includes a sending subunit, a buffering subunit and a loading subunit; the passing through each of the plurality of first loading units The first loading unit accesses the storage unit in parallel, and reads the input characteristic data from the storage unit, including: sending a read instruction to the storage unit through the sending subunit; buffering the input feature data returned by the read instruction; and loading the buffered input feature data into at least one row of processing units in the systolic processing array through the loading subunit.
在一些实施例中,所述缓存子单元的高度等于所述脉动处理阵列的高度。In some embodiments, the height of the cache subunit is equal to the height of the systolic processing array.
在一些实施例中,在所述脉动处理阵列包括多个阵列块的情况下,每个缓存子单元对应一个阵列块,且所述缓存子单元的高度等于对应阵列块的高度。In some embodiments, when the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the height of the corresponding array block.
在一些实施例中,所述缓存子单元的数量为多个;所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:通过多个缓存子单元,依次对所述根据所述读取指令返回的输入特征数据进行缓存;所述通过装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中,包括:依次将所述多个缓存子单元中的每个缓存子单元中的输入特征数据装载到所述脉动处理阵列的所述至少一行处理单元中。In some embodiments, the number of the cache sub-units is multiple; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: through the multiple cache sub-units, sequentially The buffering according to the input characteristic data returned by the read instruction; and the loading of the buffered input characteristic data into at least one row of processing units in the systolic processing array by loading subunits includes: sequentially loading the Input feature data in each of the plurality of cache subunits is loaded into the at least one row of processing units of the systolic processing array.
在一些实施例中,所述通过所述多个缓存子单元,依次对所述根据所述读取指令返回的输入特征数据进行缓存,包括:在所述多个缓存子单元中的前一个缓存子单元对应的输入特征数据缓存完成的情况下,开始对所述多个缓存子单元中的当前缓存子单元对应的输入特征数据进行缓存。In some embodiments, the step of sequentially caching the input feature data returned according to the read instruction through the multiple cache subunits includes: a previous cache in the multiple cache subunits When the buffering of the input feature data corresponding to the subunits is completed, the buffering of the input feature data corresponding to the current buffering subunit among the plurality of buffering subunits is started.
在一些实施例中,所述缓存子单元包括多个缓存块;所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:通过所述多个缓存块中的每个缓存块缓存所述脉动处理阵列中的一行处理单元所需的输入特征数据;所述通过所 述装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中,包括:通过所述装载子单元,将所述多个缓存块中的每个缓存块缓存的输入特征数据装载到所述脉动处理阵列中对应的一行处理单元;其中,在第v个缓存块缓存的输入特征数据装载完成之后,向第v+1个缓存块发送第一装载指令,以使所述装载子单元开始对所述第v+1个缓存块缓存的输入特征数据装载进行装载;其中,v为大于1的整数。In some embodiments, the cache subunit includes a plurality of cache blocks; the caching of the input feature data returned according to the read instruction by the cache subunit includes: passing through the cache blocks in the plurality of cache blocks. Each cache block buffers the input characteristic data required by one row of processing units in the systolic processing array; the loading subunit loads the buffered input characteristic data into at least one row of processing units in the systolic processing array in the systolic processing array, including: loading the input feature data buffered by each buffer block in the plurality of buffer blocks into a corresponding row of processing units in the systolic processing array through the loading subunit; wherein, in the vth buffer After the input feature data of the block cache is loaded, a first load instruction is sent to the v+1th cache block, so that the loading subunit starts to load the input feature data of the v+1th cache block cache. ; where v is an integer greater than 1.
在一些实施例中,所述通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取所述输入特征数据,包括:在所述多个第一装载单元中的每个第一装载单元对应的输入特征数据装载完成之后,向下一个第一装载单元发送第二装载指令,以触发下一个第一装载单元向所述脉动处理阵列装载所述输入特征数据。In some embodiments, accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit, includes: in the multiple first loading units After the input feature data corresponding to each of the first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to load the systolic processing array the input feature data.
在一些实施例中,所述发送子单元包括第一解析子单元和第二解析子单元;以及所述通过发送子单元,向所述存储单元发送读取指令,包括:通过第一解析子单元,接收装载指令,对所述装载指令进行解析,以生成待装载的输入特征数据的描述信息;以及通过第二解析子单元,对所述描述信息进行解析,根据解析结果向所述存储单元发送所述读取指令。In some embodiments, the sending subunit includes a first parsing subunit and a second parsing subunit; and the sending a read instruction to the storage unit by the sending subunit includes: passing the first parsing subunit , receive the loading instruction, parse the loading instruction to generate description information of the input feature data to be loaded; and parse the description information through the second parsing subunit, and send the description information to the storage unit according to the parsing result the read command.
在一些实施例中,所述描述信息包括以下至少任一:待装载的输入特征数据的组数,待装载的第一组输入特征数据的编号,对应阵列块能够同时处理的输入特征数据的组数,所述脉动处理阵列同时处理的输入特征数据的组数,卷积核的高度,对应阵列块的起始位置,待装载的输入特征数据的基地址,待装载的每组输入特征数据所占用存储空间的大小,以及待装载的输入特征数据的宽度。In some embodiments, the description information includes at least one of the following: the number of groups of input feature data to be loaded, the number of the first group of input feature data to be loaded, and the group of input feature data that can be processed simultaneously by the corresponding array block. number, the number of groups of input feature data processed by the pulsation processing array at the same time, the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the location of each group of input feature data to be loaded The size of the occupied storage space and the width of the input feature data to be loaded.
在一些实施例中,所述缓存子单元包括第一缓存子单元和第二缓存子单元;所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:生成辅助装载信息,并且通过所述辅助装载信息,确定所述待装载的输入特征数据的装载方式;通过第一缓存子单元,对所述辅助装载信息进行缓存;通过第二缓存子单元,从所述第一缓存子单元中读取所述辅助装载信息,并根据所述辅助装载信息对根据所述读取指令返回的输入特征数据进行缓存。In some embodiments, the cache sub-unit includes a first cache sub-unit and a second cache sub-unit; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: generating an auxiliary loading information, and determine the loading mode of the input feature data to be loaded through the auxiliary loading information; cache the auxiliary loading information through the first cache subunit; The auxiliary loading information is read in the first cache subunit, and the input feature data returned according to the read instruction is cached according to the auxiliary loading information.
在一些实施例中,所述通过第二缓存子单元,从所述第一缓存子单元中读取所述辅助装载信息,并根据所述辅助装载信息对所述输入特征数据进行缓存,进一步包括:通过第三缓存子单元,对所述存储单元根据所述读取指令返回的输入特征数据进 行缓存;通过读写子单元,从所述第一缓存子单元中读取所述辅助装载信息,根据所述辅助装载信息对所述第三缓存子单元缓存的输入特征数据进行重排,并将重排后的输入特征数据写入所述第四缓存子单元;以及通过所述第四缓存子单元,对重排后的输入特征数据进行缓存,以供所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中。In some embodiments, the second cache subunit reads the auxiliary loading information from the first cache subunit, and caches the input feature data according to the auxiliary loading information, further comprising: : through the third cache subunit, cache the input characteristic data returned by the storage unit according to the read instruction; through the read-write subunit, read the auxiliary loading information from the first cache subunit, Rearrange the input feature data cached by the third cache subunit according to the auxiliary loading information, and write the rearranged input feature data into the fourth cache subunit; and pass the fourth cache subunit a unit for buffering the rearranged input feature data, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
在一些实施例中,所述第四缓存子单元的数量为多个;所述通过所述第四缓存子单元,对重排后的输入特征数据进行缓存,包括:通过多个第四缓存子单元,依次对所述重排后的输入特征数据进行缓存;所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中,包括:通过所述装载子单元,依次将所述多个第四缓存子单元中的每个第四缓存子单元中的输入特征数据装载到所述脉动处理阵列中。In some embodiments, the number of the fourth buffer sub-units is multiple; the caching of the rearranged input feature data by the fourth buffer sub-unit includes: using a plurality of fourth buffer sub-units a unit that sequentially caches the rearranged input feature data; the loading subunit loads the rearranged input feature data into the systolic processing array, including: through the loading subunit, sequentially The input feature data in each of the plurality of fourth buffer subunits is loaded into the systolic processing array.
在一些实施例中,在所述多个第四缓存子单元中的前一个第四缓存子单元对应的重排后的输入特征数据缓存完成的情况下,开始对所述多个第四缓存子单元中的当前第四缓存子单元对应的重排后的输入特征数据进行缓存。In some embodiments, when the rearranged input feature data corresponding to the previous fourth cache subunit in the plurality of fourth cache subunits is cached, start to cache the plurality of fourth cache subunits. The rearranged input feature data corresponding to the current fourth cache subunit in the unit is cached.
在一些实施例中,所述多个第四缓存子单元中的每个第四缓存子单元包括多个第五缓存子单元,所述第四缓存子单元包括的多个第五缓存子单元中每个第五缓存子单元的高度等于所述第四缓存子单元的高度,每个第五缓存子单元的宽度等于从所述存储单元中读取的输入特征数据的位宽。In some embodiments, each fourth cache subunit of the plurality of fourth cache subunits includes a plurality of fifth cache subunits, and among the plurality of fifth cache subunits included in the fourth cache subunit The height of each fifth cache subunit is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.
在一些实施例中,所述方法还包括:在对所述第三缓存子单元缓存的输入特征数据进行重排的过程中,对所述第三缓存子单元缓存的输入特征数据进行第二填充。In some embodiments, the method further includes: in the process of rearranging the input feature data cached by the third cache subunit, performing a second filling on the input feature data cached by the third cache subunit .
在一些实施例中,所述对所述第三缓存子单元缓存的输入特征数据进行第二填充,包括:对所述第三缓存子单元缓存的输入特征数据的上边界和下边界中的至少一者进行填充。In some embodiments, performing the second filling on the input feature data cached by the third cache subunit includes: performing at least one of an upper boundary and a lower boundary of the input feature data cached by the third cache subunit One is filled.
在一些实施例中,所述辅助装载信息包括以下至少任一:卷积核参数在输入特征数据的列方向上的滑动步长,空洞卷积核的高度以及填充参数。In some embodiments, the auxiliary loading information includes at least one of the following: a sliding step size of a convolution kernel parameter in a column direction of the input feature data, a height of a hole convolution kernel, and a padding parameter.
在一些实施例中,所述脉动处理阵列包括多个阵列块,每个阵列块分别用于处理一组输入特征数据;所述将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元,包括:通过所述多个第一装载单元中的每个第一装载单元分别向至少一个阵列块装载所述输入特征数据。In some embodiments, the systolic processing array includes a plurality of array blocks, each of which is used to process a set of input feature data; the buffered input feature data is loaded into at least one row in the systolic processing array The processing unit includes: respectively loading the input feature data to at least one array block through each of the plurality of first loading units.
在一些实施例中,所述多个第一装载单元中的每个第一装载单元包括缓存子单 元,所述缓存子单元的深度与所述第一装载单元装载的阵列块的深度相等;所述对读取的所述输入特征数据进行缓存,包括:通过所述缓存子单元,对从所述存储单元读取的输入特征数据进行缓存。In some embodiments, each first loading unit of the plurality of first loading units includes a cache subunit, and the depth of the cache subunit is equal to the depth of the array block loaded by the first loading unit; the The caching of the read input feature data includes: caching the input feature data read from the storage unit through the cache subunit.
在一些实施例中,各个阵列块的高度均相等。In some embodiments, the heights of each array block are all equal.
在一些实施例中,一个阵列块的尺寸等于所述脉动处理阵列中所装载的一个卷积核参数的尺寸。In some embodiments, the size of an array block is equal to the size of a convolution kernel parameter loaded in the systolic processing array.
在一些实施例中,所述脉动处理阵列的尺寸根据所述脉动处理阵列中所装载的卷积核参数的尺寸而确定。In some embodiments, the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
在一些实施例中,所述脉动处理阵列的尺寸为所述脉动处理阵列中所装载的一个卷积核参数的尺寸的整数倍。In some embodiments, the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
在一些实施例中,所述脉动处理阵列的列数为3的整数倍。In some embodiments, the number of columns of the systolic processing array is an integer multiple of three.
在一些实施例中,所述多个第一装载单元向所述脉动处理阵列的第u行处理单元装载数据的时刻比所述多个第一装载单元向所述脉动处理阵列的第u+1行处理单元装载数据的时刻早一个时钟周期,u为正整数。In some embodiments, the time when the plurality of first loading units load data to the u-th processing unit of the systolic processing array is earlier than the time when the plurality of first loading units load the u+1-th processing unit of the systolic processing array The time when the row processing unit loads data is one clock cycle earlier, and u is a positive integer.
上述方法实施例中第一装载单元的具体实施例与前述数据处理装置中第一装载单元602的实施例相同,此处不再赘述。The specific embodiment of the first loading unit in the above method embodiment is the same as the embodiment of the first loading unit 602 in the foregoing data processing apparatus, and details are not described herein again.
本公开实施例还包括一种数据处理装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现任一实施例所述的方法中由任一第一装载单元所执行的步骤。上述数据处理装置可以是一个数据处理芯片,例如,系统级芯片(System on Chip,SoC)。Embodiments of the present disclosure further include a data processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the program described in any of the embodiments when the processor executes the program. The steps of the method performed by any of the first loading units. The above-mentioned data processing apparatus may be a data processing chip, for example, a system-on-chip (SoC).
本说明书实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例所述的方法。The embodiments of the present specification further provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the foregoing embodiments.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD) 或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书实施例可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书实施例各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the embodiments of the present specification can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of this specification or the parts that make contributions to the prior art may be embodied in the form of software products, and the computer software products may be stored in storage media, such as ROM/RAM, A magnetic disk, an optical disk, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments in this specification.
上述实施例阐明的系统、装置、模块、单元或神经网络加速器,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules, units or neural network accelerators described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable device, or a combination of any of these devices.
以上实施例中的各种技术特征可以任意进行组合,只要特征之间的组合不存在冲突或矛盾,但是限于篇幅,未进行一一描述,因此上述实施方式中的各种技术特征的任意进行组合也属于本公开的范围。Various technical features in the above embodiments can be combined arbitrarily, as long as there is no conflict or contradiction between the combinations of features, but due to space limitations, they are not described one by one, so the various technical features in the above embodiments can be combined arbitrarily It is also within the scope of this disclosure.
本领域技术人员在考虑公开及实践这里公开的说明书后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the disclosure and practice of the specification disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
以上所述仅为本公开的较佳实施例,并不用以限制本公开,凡在本公开的精神和原则之内所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。The above are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims (75)

  1. 一种数据处理装置,用于将输入特征数据装载到脉动处理阵列中,其特征在于,所述装置包括:A data processing device for loading input feature data into a pulsation processing array, characterized in that the device comprises:
    多个第一装载单元,所述多个第一装载单元中的每个第一装载单元用于:a plurality of first loading units, each first loading unit of the plurality of first loading units for:
    并行地访问存储单元,以从所述存储单元读取输入特征数据;accessing memory cells in parallel to read input feature data from the memory cells;
    对读取的所述输入特征数据进行缓存;以及buffering the read input feature data; and
    将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
  2. 根据权利要求1所述的数据处理装置,其特征在于,所述多个第一装载单元对所述输入特征数据中的有效数据的缓存速率之和大于或等于所述脉动处理阵列的装载速率。The data processing apparatus according to claim 1, wherein the sum of the buffering rates of valid data in the input characteristic data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
  3. 根据权利要求2所述的数据处理装置,其特征在于,每个时钟周期内所述输入特征数据的缓存速率之和均大于或等于所述脉动处理阵列在该时钟周期内的装载速率;或者The data processing apparatus according to claim 2, wherein the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or
    所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的平均装载速率;或者The sum of the average cache rates of the input feature data is greater than or equal to the average load rate of the systolic processing array; or
    所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的最大装载速率。The sum of the average cache rates of the input feature data is greater than or equal to the maximum load rate of the systolic processing array.
  4. 根据权利要求1所述的数据处理装置,其特征在于,所述多个第一装载单元还用于:The data processing apparatus according to claim 1, wherein the plurality of first loading units are further used for:
    在对所述输入特征数据进行装载的过程中,对所述输入特征数据进行第一填充。In the process of loading the input feature data, a first filling is performed on the input feature data.
  5. 根据权利要求4所述的数据处理装置,其特征在于,所述第一填充包括对所述输入特征数据的左边界和右边界中的至少一者进行填充。The data processing apparatus according to claim 4, wherein the first padding comprises padding at least one of a left boundary and a right boundary of the input feature data.
  6. 根据权利要求1所述的数据处理装置,其特征在于,所述第一装载单元能够每次从所述存储单元读取所述输入特征数据中的一个数据块,在对所述数据块缓存完毕的情况下,能够将所述数据块装载到所述脉动处理阵列中的至少一行处理单元中。The data processing device according to claim 1, wherein the first loading unit can read one data block of the input feature data from the storage unit at a time, and after caching the data block is completed The data block can be loaded into at least one row of processing units in the systolic processing array.
  7. 根据权利要求6所述的数据处理装置,其特征在于,所述数据块的列数或行数等于所述存储单元的最小访问单位对应的数据个数。The data processing apparatus according to claim 6, wherein the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
  8. 根据权利要求1所述的数据处理装置,其特征在于,所述第一装载单元包括:The data processing apparatus according to claim 1, wherein the first loading unit comprises:
    发送子单元,用于向所述存储单元发送读取指令;a sending subunit for sending a read instruction to the storage unit;
    缓存子单元,用于对所述存储单元根据所述读取指令返回的输入特征数据进行缓存;以及a cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction; and
    装载子单元,用于将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中。The loading subunit is used for loading the buffered input feature data into at least one row of processing units in the systolic processing array.
  9. 根据权利要求8所述的数据处理装置,其特征在于,所述缓存子单元的高度等于所述脉动处理阵列的高度。The data processing apparatus according to claim 8, wherein the height of the cache subunit is equal to the height of the systolic processing array.
  10. 根据权利要求9所述的数据处理装置,其特征在于,在所述脉动处理阵列包括多个阵列块的情况下,每个缓存子单元对应一个阵列块,且所述缓存子单元的高度等于对应阵列块的高度。The data processing apparatus according to claim 9, wherein, in the case that the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the corresponding The height of the array block.
  11. 根据权利要求8所述的数据处理装置,其特征在于,所述缓存子单元的数量为多个,多个缓存子单元能够依次对根据所述读取指令返回的输入特征数据进行缓存;The data processing device according to claim 8, wherein the number of the cache subunits is multiple, and the multiple cache subunits can sequentially cache the input feature data returned according to the read instruction;
    所述装载子单元用于依次将所述多个缓存子单元中的每个缓存子单元中的输入特征数据装载到所述脉动处理阵列的所述至少一行处理单元中。The loading subunit is configured to sequentially load the input feature data in each of the plurality of buffering subunits into the at least one row of processing units of the systolic processing array.
  12. 根据权利要求11所述的数据处理装置,其特征在于,多个缓存子单元中的每个缓存子单元在前一个缓存子单元对应的输入特征数据缓存完成的情况下,能够开始对当前缓存子单元对应的输入特征数据进行缓存。The data processing device according to claim 11, wherein each cache subunit in the plurality of cache subunits can start to process the current cache subunit when the input feature data corresponding to the previous cache subunit is cached. The input feature data corresponding to the unit is cached.
  13. 根据权利要求8所述的数据处理装置,其特征在于,所述缓存子单元包括多个缓存块,所述多个缓存块中的每个缓存块用于缓存所述脉动处理阵列中的一行处理单元所需的输入特征数据,所述装载子单元用于将所述多个缓存块中的每个缓存块缓存的输入特征数据装载到所述脉动处理阵列中对应的一行处理单元;The data processing apparatus according to claim 8, wherein the cache subunit comprises a plurality of cache blocks, and each cache block in the plurality of cache blocks is used to cache a row of processing in the systolic processing array input feature data required by the unit, the loading subunit is configured to load the input feature data buffered by each cache block in the plurality of cache blocks into a corresponding row of processing units in the systolic processing array;
    其中,第v个缓存块缓存的输入特征数据装载完成之后,能够向第v+1个缓存块发送第一装载指令,以使所述装载子单元开始对所述第v+1个缓存块缓存的输入特征数据装载进行装载,v为大于1的整数。Wherein, after the input feature data of the vth cache block is loaded, a first load instruction can be sent to the v+1th cache block, so that the loading subunit starts to cache the v+1th cache block The input feature data is loaded for loading, and v is an integer greater than 1.
  14. 根据权利要求1所述的数据处理装置,其特征在于,所述多个第一装载单元中每个第一装载单元还用于:The data processing apparatus according to claim 1, wherein each first loading unit in the plurality of first loading units is further used for:
    在所述多个第一装载单元中的每个第一装载单元对应的输入特征数据装载完成之后,向下一个第一装载单元发送第二装载指令,以触发下一个第一装载单元向所述脉动处理阵列装载所述下一个第一装载单元对应的输入特征数据。After the input feature data corresponding to each first loading unit in the plurality of first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to send the The systolic processing array loads the input feature data corresponding to the next first loading unit.
  15. 根据权利要求8所述的数据处理装置,其特征在于,所述发送子单元包括:The data processing device according to claim 8, wherein the sending subunit comprises:
    第一解析子单元,用于接收装载指令,对所述装载指令进行解析,以生成待装载的输入特征数据的描述信息;以及a first parsing subunit, configured to receive a loading instruction, and parse the loading instruction to generate description information of the input feature data to be loaded; and
    第二解析子单元,用于对所述描述信息进行解析,根据解析结果向所述存储单元 发送所述读取指令。The second parsing subunit is configured to parse the description information, and send the read instruction to the storage unit according to the parsing result.
  16. 根据权利要求15所述的数据处理装置,其特征在于,所述描述信息包括以下至少任一:The data processing apparatus according to claim 15, wherein the description information includes at least any one of the following:
    待装载的输入特征数据的组数,待装载的第一组输入特征数据的编号,对应阵列块能够同时处理的输入特征数据的组数,所述脉动处理阵列同时处理的输入特征数据的组数,卷积核的高度,对应阵列块的起始位置,待装载的输入特征数据的基地址,待装载的每组输入特征数据所占用存储空间的大小,以及待装载的输入特征数据的宽度。The number of groups of input characteristic data to be loaded, the number of the first group of input characteristic data to be loaded, the number of groups of input characteristic data that can be processed simultaneously by the corresponding array block, the number of groups of input characteristic data to be processed simultaneously by the pulsation processing array , the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the size of the storage space occupied by each group of input feature data to be loaded, and the width of the input feature data to be loaded.
  17. 根据权利要求15所述的数据处理装置,其特征在于,所述第二解析子单元还用于:The data processing apparatus according to claim 15, wherein the second parsing subunit is further configured to:
    生成辅助装载信息,所述辅助装载信息用于确定所述待装载的输入特征数据的装载方式;generating auxiliary loading information, the auxiliary loading information is used to determine the loading mode of the input feature data to be loaded;
    所述缓存子单元包括:The cache subunit includes:
    第一缓存子单元,用于对所述辅助装载信息进行缓存;以及a first cache subunit, configured to cache the auxiliary loading information; and
    第二缓存子单元,用于从所述第一缓存子单元中读取所述辅助装载信息,并根据所述辅助装载信息对根据所述读取指令返回的输入特征数据进行缓存。The second cache subunit is configured to read the auxiliary loading information from the first cache subunit, and cache the input feature data returned according to the read instruction according to the auxiliary loading information.
  18. 根据权利要求17所述的数据处理装置,其特征在于,所述第二缓存子单元包括:The data processing apparatus according to claim 17, wherein the second cache subunit comprises:
    第三缓存子单元,用于对所述存储单元根据所述读取指令返回的输入特征数据进行缓存;A third cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction;
    读写子单元,用于从所述第一缓存子单元中读取所述辅助装载信息,根据所述辅助装载信息对所述第三缓存子单元缓存的输入特征数据进行重排,并将重排后的输入特征数据写入第四缓存子单元;以及The read-write subunit is configured to read the auxiliary loading information from the first buffering subunit, rearrange the input feature data buffered by the third buffering subunit according to the auxiliary loading information, and reorder the The rowed input feature data is written into the fourth buffer subunit; and
    第四缓存子单元,用于对重排后的输入特征数据进行缓存,以供所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中。The fourth buffering subunit is used for buffering the rearranged input feature data, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
  19. 根据权利要求18所述的数据处理装置,其特征在于,所述第四缓存子单元的数量为多个,多个第四缓存子单元能够依次对所述重排后的输入特征数据进行缓存;The data processing device according to claim 18, wherein the number of the fourth buffer subunits is plural, and the plurality of fourth buffer subunits can sequentially buffer the rearranged input feature data;
    所述装载子单元用于依次将所述多个第四缓存子单元中的每个第四缓存子单元中的输入特征数据装载到所述脉动处理阵列中。The loading subunit is configured to sequentially load the input characteristic data in each of the fourth buffering subunits of the plurality of fourth buffering subunits into the systolic processing array.
  20. 根据权利要求19所述的数据处理装置,其特征在于,所述多个第四缓存子单元中的每个第四缓存子单元在前一个第四缓存子单元缓存完成的情况下能够开始对 所述重排后的输入特征数据进行缓存。The data processing apparatus according to claim 19, wherein each fourth cache sub-unit in the plurality of fourth cache sub-units can start to cache all the fourth cache sub-units when the previous fourth cache sub-unit is cached. The rearranged input feature data is cached.
  21. 根据权利要求19所述的数据处理装置,其特征在于,所述多个第四缓存子单元中的每个第四缓存子单元包括多个第五缓存子单元,所述第四缓存子单元包括的多个第五缓存子单元中每个第五缓存子单元的高度等于所述第四缓存子单元的高度,每个第五缓存子单元的宽度等于从所述存储单元中读取的输入特征数据的位宽。The data processing apparatus according to claim 19, wherein each fourth cache subunit in the plurality of fourth cache subunits comprises a plurality of fifth cache subunits, and the fourth cache subunit includes The height of each fifth cache subunit in the plurality of fifth cache subunits is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the input feature read from the storage unit The bit width of the data.
  22. 根据权利要求18所述的数据处理装置,其特征在于,所述读写子单元还用于:The data processing device according to claim 18, wherein the read-write subunit is further used for:
    在对所述第三缓存子单元缓存的输入特征数据进行重排的过程中,对所述第三缓存子单元缓存的输入特征数据进行第二填充。In the process of rearranging the input feature data cached by the third cache subunit, a second filling is performed on the input feature data cached by the third cache subunit.
  23. 根据权利要求22所述的数据处理装置,其特征在于,所述第二填充包括对所述第三缓存子单元缓存的输入特征数据的上边界和下边界中的至少一者进行填充。The data processing apparatus according to claim 22, wherein the second filling comprises filling at least one of an upper boundary and a lower boundary of the input feature data buffered by the third buffer subunit.
  24. 根据权利要求17所述的数据处理装置,其特征在于,所述辅助装载信息包括以下至少任一:The data processing apparatus according to claim 17, wherein the auxiliary loading information includes at least any one of the following:
    卷积核参数在输入特征数据的列方向上的滑动步长,空洞卷积核的高度以及填充参数。The sliding step size of the convolution kernel parameters in the column direction of the input feature data, the height of the hole convolution kernel, and the padding parameter.
  25. 根据权利要求1所述的数据处理装置,其特征在于,所述脉动处理阵列包括多个阵列块,每个阵列块分别用于处理一组输入特征数据,所述多个第一装载单元中的每个第一装载单元分别用于向至少一个阵列块装载所述输入特征数据。The data processing device according to claim 1, wherein the systolic processing array comprises a plurality of array blocks, each array block is respectively used to process a set of input feature data, the plurality of first loading units Each of the first loading units is respectively used for loading the input feature data into at least one array block.
  26. 根据权利要求25所述的数据处理装置,其特征在于,所述多个第一装载单元中的每个第一装载单元包括缓存子单元,用于对从所述存储单元读取的输入特征数据进行缓存,所述缓存子单元的深度与所述第一装载单元装载的阵列块的深度相等。The data processing apparatus according to claim 25, wherein each first loading unit in the plurality of first loading units includes a buffer sub-unit for processing the input feature data read from the storage unit Cache is performed, and the depth of the cache subunit is equal to the depth of the array block loaded by the first loading unit.
  27. 根据权利要求25所述的数据处理装置,其特征在于,各个阵列块的高度均相等。The data processing apparatus according to claim 25, wherein the heights of each array block are equal.
  28. 根据权利要求27所述的数据处理装置,其特征在于,一个阵列块的尺寸等于所述脉动处理阵列中所装载的一个卷积核参数的尺寸。The data processing apparatus according to claim 27, wherein the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array.
  29. 根据权利要求1至28任意一项所述的数据处理装置,其特征在于,所述脉动处理阵列的尺寸根据所述脉动处理阵列中所装载的卷积核参数的尺寸而确定。The data processing apparatus according to any one of claims 1 to 28, wherein the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
  30. 根据权利要求29所述的数据处理装置,其特征在于,所述脉动处理阵列的尺寸为所述脉动处理阵列中所装载的一个卷积核参数的尺寸的整数倍。The data processing apparatus according to claim 29, wherein the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
  31. 根据权利要求1至28任意一项所述的数据处理装置,其特征在于,所述脉动处理阵列的列数为3的整数倍。The data processing apparatus according to any one of claims 1 to 28, wherein the number of columns of the systolic processing array is an integer multiple of 3.
  32. 根据权利要求1至28任意一项所述的数据处理装置,其特征在于,所述多个 第一装载单元向所述脉动处理阵列的第u行处理单元装载数据的时刻比所述多个第一装载单元向所述脉动处理阵列的第u+1行处理单元装载数据的时刻早一个时钟周期,u为正整数。The data processing apparatus according to any one of claims 1 to 28, wherein the time at which the plurality of first loading units load data to the processing unit in the u-th row of the systolic processing array is shorter than the time at which the plurality of first loading units load data to the processing unit in the u-th row of the pulsating processing array The time when a loading unit loads data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
  33. 一种数据处理系统,其特征在于,包括权利要求1至32任意一项所述的装置;以及A data processing system, characterized in that it comprises the device according to any one of claims 1 to 32; and
    脉动处理阵列,用于装载所述输入特征数据和卷积核参数,并对所述输入特征数据和所述卷积核参数进行卷积处理。The systolic processing array is used for loading the input feature data and convolution kernel parameters, and performing convolution processing on the input feature data and the convolution kernel parameters.
  34. 根据权利要求33所述的数据处理系统,其特征在于,所述系统还包括:The data processing system of claim 33, wherein the system further comprises:
    第二装载单元,用于将所述卷积核参数装载到所述脉动处理阵列中。A second loading unit, configured to load the convolution kernel parameters into the systolic processing array.
  35. 根据权利要求33所述的数据处理系统,其特征在于,所述系统还包括:The data processing system of claim 33, wherein the system further comprises:
    存储单元,用于存储所述输入特征数据。a storage unit for storing the input feature data.
  36. 根据权利要求35所述的数据处理系统,其特征在于,所述存储单元包括:The data processing system of claim 35, wherein the storage unit comprises:
    多个相互独立的存储子单元,每个存储子单元用于存储所述输入特征数据中的部分数据;A plurality of mutually independent storage subunits, each storage subunit is used to store part of the data in the input feature data;
    所述多个第一装载单元用于在同一时刻访问不同的存储子单元,以获取对应存储子单元存储的输入特征数据。The plurality of first loading units are used to access different storage subunits at the same time to acquire input feature data stored in the corresponding storage subunits.
  37. 根据权利要求36所述的数据处理系统,其特征在于,所述存储单元还包括:The data processing system of claim 36, wherein the storage unit further comprises:
    调度单元,用于接收所述多个第一装载单元的访问请求,并将所述访问请求发送至对应的存储子单元,以使所述多个第一装载单元访问对应的存储子单元。The scheduling unit is configured to receive access requests from the multiple first loading units, and send the access requests to the corresponding storage subunits, so that the multiple first loading units access the corresponding storage subunits.
  38. 根据权利要求33所述的数据处理系统,其特征在于,所述系统还包括:The data processing system of claim 33, wherein the system further comprises:
    输出单元,用于获取所述脉动处理阵列输出的处理结果,并对所述处理结果进行存储,或者将输出所述处理结果。An output unit, configured to acquire the processing result output by the systolic processing array, store the processing result, or output the processing result.
  39. 根据权利要求33至38任意一项所述的系统,其特征在于,所述系统基于FPGA或者ASIC实现。The system according to any one of claims 33 to 38, wherein the system is implemented based on an FPGA or an ASIC.
  40. 一种神经网络加速器,其特征在于,所述神经网络加速器包括权利要求1至32任意一项所述的装置,或者包括权利要求33至39任意一项所述的系统。A neural network accelerator, characterized in that, the neural network accelerator comprises the device described in any one of claims 1 to 32, or the system described in any one of claims 33 to 39.
  41. 根据权利要求40所述的神经网络加速器,其特征在于,所述神经网络加速器为CNN加速器或者RNN加速器。The neural network accelerator according to claim 40, wherein the neural network accelerator is a CNN accelerator or an RNN accelerator.
  42. 一种数据处理方法,应用于包括多个第一装载单元的数据处理装置中的每个第一装载单元,以将输入特征数据装载到脉动处理阵列中,其特征在于,所述方法包括:A data processing method, applied to each first loading unit in a data processing apparatus including a plurality of first loading units, to load input characteristic data into a pulsation processing array, wherein the method comprises:
    通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取所述输入特征数据;The input feature data is read from the storage unit by accessing the storage unit in parallel by each of the plurality of first loading units;
    对读取的所述输入特征数据进行缓存;以及buffering the read input feature data; and
    将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元。The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
  43. 根据权利要求42所述的数据处理方法,其特征在于,所述多个第一装载单元对所述输入特征数据中的有效数据的缓存速率之和大于或等于所述脉动处理阵列的装载速率。The data processing method according to claim 42, wherein the sum of the buffering rates of the valid data in the input characteristic data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
  44. 根据权利要求43所述的数据处理方法,其特征在于,每个时钟周期内所述输入特征数据的缓存速率之和均大于或等于所述脉动处理阵列在该时钟周期内的装载速率;或者The data processing method according to claim 43, wherein the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or
    所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的平均装载速率;或者The sum of the average cache rates of the input feature data is greater than or equal to the average load rate of the systolic processing array; or
    所述输入特征数据的平均缓存速率之和大于或等于所述脉动处理阵列的最大装载速率。The sum of the average cache rates of the input feature data is greater than or equal to the maximum load rate of the systolic processing array.
  45. 根据权利要求42所述的数据处理方法,其特征在于,所述方法还包括:The data processing method according to claim 42, wherein the method further comprises:
    在对所述输入特征数据进行装载的过程中,对所述输入特征数据进行第一填充。In the process of loading the input feature data, a first filling is performed on the input feature data.
  46. 根据权利要求45所述的数据处理方法,其特征在于,所述对所述输入特征数据进行第一填充,包括:The data processing method according to claim 45, wherein the performing the first filling on the input feature data comprises:
    对所述输入特征数据的左边界和右边界中的至少一者进行填充。Padding is performed on at least one of the left boundary and the right boundary of the input feature data.
  47. 根据权利要求42所述的数据处理方法,其特征在于,所述通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取输入特征数据,包括:The data processing method according to claim 42, wherein the input feature data is read from the storage unit by accessing a storage unit in parallel through each of the plurality of first loading units ,include:
    每次从所述存储单元读取所述输入特征数据中的一个数据块;Each time a data block in the input feature data is read from the storage unit;
    所述将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元,包括:The loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes:
    在对所述数据块缓存完毕的情况下,将所述数据块装载到所述脉动处理阵列中的至少一行处理单元中。When the data block is cached, the data block is loaded into at least one row of processing units in the systolic processing array.
  48. 根据权利要求47所述的数据处理方法,其特征在于,所述数据块的列数或行数等于所述存储单元的最小访问单位对应的数据个数。The data processing method according to claim 47, wherein the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
  49. 根据权利要求42所述的数据处理方法,其特征在于,所述多个第一装载单元中的每个第一装载单元包括发送子单元,缓存子单元和装载子单元;所述通过所述多 个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取输入特征数据,包括:The data processing method according to claim 42, wherein each first loading unit in the plurality of first loading units comprises a sending subunit, a buffering subunit and a loading subunit; Each of the first loading units accesses the storage unit in parallel, and reads input feature data from the storage unit, including:
    通过所述发送子单元,向所述存储单元发送读取指令;Send a read instruction to the storage unit through the sending subunit;
    通过所述缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存;以及通过所述装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中。The input feature data returned according to the read instruction is cached by the cache subunit; and the cached input feature data is loaded into at least one row of processing units in the systolic processing array by the loading subunit .
  50. 根据权利要求49所述的数据处理方法,其特征在于,所述缓存子单元的高度等于所述脉动处理阵列的高度。The data processing method according to claim 49, wherein the height of the cache subunit is equal to the height of the systolic processing array.
  51. 根据权利要求50所述的数据处理方法,其特征在于,在所述脉动处理阵列包括多个阵列块的情况下,每个缓存子单元对应一个阵列块,且所述缓存子单元的高度等于对应阵列块的高度。The data processing method according to claim 50, wherein when the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the corresponding The height of the array block.
  52. 根据权利要求49所述的数据处理方法,其特征在于,所述缓存子单元的数量为多个;The data processing method according to claim 49, wherein the number of the cache subunits is multiple;
    所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:The buffering subunit for buffering the input feature data returned according to the read instruction includes:
    通过多个缓存子单元,依次对所述根据所述读取指令返回的输入特征数据进行缓存;The input feature data returned according to the read instruction is sequentially cached through a plurality of cache subunits;
    所述通过装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的至少一行处理单元中,包括:The loading of the buffered input feature data into at least one row of processing units in the systolic processing array by loading subunits includes:
    依次将所述多个缓存子单元中的每个缓存子单元中的输入特征数据装载到所述脉动处理阵列的所述至少一行处理单元中。The input feature data in each of the plurality of cache subunits is sequentially loaded into the at least one row of processing units of the systolic processing array.
  53. 根据权利要求52所述的数据处理方法,其特征在于,所述通过所述多个缓存子单元,依次对所述根据所述读取指令返回的输入特征数据进行缓存,包括:The data processing method according to claim 52, wherein the step of sequentially caching the input feature data returned according to the read instruction by using the plurality of cache subunits comprises:
    在所述多个缓存子单元中的前一个缓存子单元对应的输入特征数据缓存完成的情况下,开始对所述多个缓存子单元中的当前缓存子单元对应的输入特征数据进行缓存。When the input feature data corresponding to the previous cache subunit in the plurality of cache subunits is buffered, start to cache the input feature data corresponding to the current cache subunit in the plurality of cache subunits.
  54. 根据权利要求49所述的数据处理方法,其特征在于,所述缓存子单元包括多个缓存块;The data processing method according to claim 49, wherein the cache subunit comprises a plurality of cache blocks;
    所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:The buffering subunit for buffering the input feature data returned according to the read instruction includes:
    通过所述多个缓存块中的每个缓存块缓存所述脉动处理阵列中的一行处理单元所需的输入特征数据;Cache input feature data required by a row of processing units in the systolic processing array by using each of the plurality of cache blocks;
    所述通过所述装载子单元,将缓存的输入特征数据装载到所述脉动处理阵列中的 至少一行处理单元中,包括:Described through the described loading subunit, the input feature data of the buffer is loaded into at least one row of processing units in the systolic processing array, including:
    通过所述装载子单元,将所述多个缓存块中的每个缓存块缓存的输入特征数据装载到所述脉动处理阵列中对应的一行处理单元;其中,在第v个缓存块缓存的输入特征数据装载完成之后,向第v+1个缓存块发送第一装载指令,以使所述装载子单元开始对所述第v+1个缓存块缓存的输入特征数据装载进行装载;其中,v为大于1的整数。By the loading subunit, the input feature data cached in each cache block in the plurality of cache blocks is loaded into a corresponding row of processing units in the systolic processing array; wherein, in the vth cache block cached input feature data After the feature data loading is completed, a first loading instruction is sent to the v+1th cache block, so that the loading subunit starts to load the input feature data cached in the v+1th cache block; wherein, v is an integer greater than 1.
  55. 根据权利要求42所述的数据处理方法,其特征在于,所述通过所述多个第一装载单元中的每个第一装载单元并行地访问存储单元,从所述存储单元读取所述输入特征数据,包括:The data processing method according to claim 42, wherein the input is read from the storage unit by accessing a storage unit in parallel through each of the plurality of first loading units Characteristic data, including:
    在所述多个第一装载单元中的每个第一装载单元对应的输入特征数据装载完成之后,向下一个第一装载单元发送第二装载指令,以触发下一个第一装载单元向所述脉动处理阵列装载所述输入特征数据。After the input feature data corresponding to each first loading unit in the plurality of first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to send the A systolic processing array loads the input feature data.
  56. 根据权利要求49所述的数据处理方法,其特征在于,所述发送子单元包括第一解析子单元和第二解析子单元;以及The data processing method according to claim 49, wherein the sending subunit comprises a first parsing subunit and a second parsing subunit; and
    所述通过发送子单元,向所述存储单元发送读取指令,包括:The sending a read instruction to the storage unit by sending the subunit includes:
    通过第一解析子单元,接收装载指令,对所述装载指令进行解析,以生成待装载的输入特征数据的描述信息;以及receiving, by the first parsing subunit, a loading instruction, and parsing the loading instruction to generate description information of the input feature data to be loaded; and
    通过第二解析子单元,对所述描述信息进行解析,根据解析结果向所述存储单元发送所述读取指令。The description information is parsed by the second parsing subunit, and the read instruction is sent to the storage unit according to the parsing result.
  57. 根据权利要求56所述的数据处理方法,其特征在于,所述描述信息包括以下至少任一:The data processing method according to claim 56, wherein the description information includes at least any one of the following:
    待装载的输入特征数据的组数,待装载的第一组输入特征数据的编号,对应阵列块能够同时处理的输入特征数据的组数,所述脉动处理阵列同时处理的输入特征数据的组数,卷积核的高度,对应阵列块的起始位置,待装载的输入特征数据的基地址,待装载的每组输入特征数据所占用存储空间的大小,以及待装载的输入特征数据的宽度。The number of groups of input feature data to be loaded, the number of the first group of input feature data to be loaded, the number of groups of input feature data that can be processed simultaneously by the corresponding array block, the number of groups of input feature data simultaneously processed by the pulsation processing array , the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the size of the storage space occupied by each group of input feature data to be loaded, and the width of the input feature data to be loaded.
  58. 根据权利要求56所述的数据处理方法,其特征在于,所述缓存子单元包括第一缓存子单元和第二缓存子单元;所述通过缓存子单元,对根据所述读取指令返回的输入特征数据进行缓存,包括:The data processing method according to claim 56, wherein the cache sub-unit comprises a first cache sub-unit and a second cache sub-unit; and the passing of the cache sub-unit, for the input returned according to the read instruction Feature data is cached, including:
    生成辅助装载信息,并且通过所述辅助装载信息,确定所述待装载的输入特征数据的装载方式;generating auxiliary loading information, and determining the loading mode of the input feature data to be loaded through the auxiliary loading information;
    通过第一缓存子单元,对所述辅助装载信息进行缓存;The auxiliary loading information is cached by the first cache subunit;
    通过第二缓存子单元,从所述第一缓存子单元中读取所述辅助装载信息,并根据所述辅助装载信息对根据所述读取指令返回的输入特征数据进行缓存。The auxiliary loading information is read from the first caching sub-unit through the second cache sub-unit, and the input feature data returned according to the read instruction is cached according to the auxiliary loading information.
  59. 根据权利要求58所述的数据处理方法,其特征在于,所述通过第二缓存子单元,从所述第一缓存子单元中读取所述辅助装载信息,并根据所述辅助装载信息对所述输入特征数据进行缓存,进一步包括:The data processing method according to claim 58, wherein the auxiliary loading information is read from the first buffer sub-unit through the second cache sub-unit, and the auxiliary loading information is processed according to the auxiliary loading information. The input feature data is cached, and further includes:
    通过第三缓存子单元,对所述存储单元根据所述读取指令返回的输入特征数据进行缓存;Through the third cache subunit, the input feature data returned by the storage unit according to the read instruction is cached;
    通过读写子单元,从所述第一缓存子单元中读取所述辅助装载信息,根据所述辅助装载信息对所述第三缓存子单元缓存的输入特征数据进行重排,并将重排后的输入特征数据写入所述第四缓存子单元;以及The auxiliary loading information is read from the first cache subunit by reading and writing subunits, the input feature data buffered by the third cache subunit is rearranged according to the auxiliary loading information, and the rearrangement is performed. After the input feature data is written into the fourth buffer subunit; and
    通过所述第四缓存子单元,对重排后的输入特征数据进行缓存,以供所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中。The rearranged input feature data is cached by the fourth buffering subunit, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
  60. 根据权利要求59所述的数据处理方法,其特征在于,所述第四缓存子单元的数量为多个;The data processing method according to claim 59, wherein the number of the fourth cache subunits is multiple;
    所述通过所述第四缓存子单元,对重排后的输入特征数据进行缓存,包括:The cached input feature data after the rearrangement is cached by the fourth cache subunit, including:
    通过多个第四缓存子单元,依次对所述重排后的输入特征数据进行缓存;Cache the rearranged input feature data in sequence through a plurality of fourth cache subunits;
    所述装载子单元将所述重排后的输入特征数据装载到所述脉动处理阵列中,包括:The loading subunit loads the rearranged input feature data into the systolic processing array, including:
    通过所述装载子单元,依次将所述多个第四缓存子单元中的每个第四缓存子单元中的输入特征数据装载到所述脉动处理阵列中。Through the loading subunit, the input feature data in each of the fourth buffering subunits of the plurality of fourth buffering subunits is sequentially loaded into the systolic processing array.
  61. 根据权利要求60所述的数据处理方法,其特征在于,在所述多个第四缓存子单元中的前一个第四缓存子单元对应的重排后的输入特征数据缓存完成的情况下,开始对所述多个第四缓存子单元中的当前第四缓存子单元对应的重排后的输入特征数据进行缓存。The data processing method according to claim 60, wherein when the rearranged input feature data corresponding to the previous fourth buffer subunit in the plurality of fourth buffer subunits is buffered, start Cache the rearranged input feature data corresponding to the current fourth cache subunit in the plurality of fourth cache subunits.
  62. 根据权利要求60所述的数据处理方法,其特征在于,所述多个第四缓存子单元中的每个第四缓存子单元包括多个第五缓存子单元,所述第四缓存子单元包括的多个第五缓存子单元中每个第五缓存子单元的高度等于所述第四缓存子单元的高度,每个第五缓存子单元的宽度等于从所述存储单元中读取的输入特征数据的位宽。The data processing method according to claim 60, wherein each fourth cache subunit in the plurality of fourth cache subunits includes a plurality of fifth cache subunits, and the fourth cache subunit includes The height of each fifth cache subunit in the plurality of fifth cache subunits is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the input feature read from the storage unit The bit width of the data.
  63. 根据权利要求59所述的数据处理方法,其特征在于,所述方法还包括:The data processing method according to claim 59, wherein the method further comprises:
    在对所述第三缓存子单元缓存的输入特征数据进行重排的过程中,对所述第三缓存子单元缓存的输入特征数据进行第二填充。In the process of rearranging the input feature data cached by the third cache subunit, a second filling is performed on the input feature data cached by the third cache subunit.
  64. 根据权利要求63所述的数据处理方法,其特征在于,所述对所述第三缓存子单元缓存的输入特征数据进行第二填充,包括:The data processing method according to claim 63, wherein the second filling of the input feature data buffered by the third buffer subunit comprises:
    对所述第三缓存子单元缓存的输入特征数据的上边界和下边界中的至少一者进行填充。Filling at least one of an upper boundary and a lower boundary of the input feature data buffered by the third buffer subunit.
  65. 根据权利要求58所述的数据处理方法,其特征在于,所述辅助装载信息包括以下至少任一:The data processing method according to claim 58, wherein the auxiliary loading information includes at least any one of the following:
    卷积核参数在输入特征数据的列方向上的滑动步长,空洞卷积核的高度以及填充参数。The sliding step size of the convolution kernel parameters in the column direction of the input feature data, the height of the hole convolution kernel, and the padding parameter.
  66. 根据权利要求42所述的数据处理方法,其特征在于,所述脉动处理阵列包括多个阵列块,每个阵列块分别用于处理一组输入特征数据;The data processing method according to claim 42, wherein the systolic processing array comprises a plurality of array blocks, and each array block is respectively used to process a set of input feature data;
    所述将缓存的所述输入特征数据装载到脉动处理阵列中的至少一行处理单元,包括:The loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes:
    通过所述多个第一装载单元中的每个第一装载单元分别向至少一个阵列块装载所述输入特征数据。The input feature data is respectively loaded into at least one array block by each of the plurality of first loading units.
  67. 根据权利要求66所述的数据处理方法,其特征在于,所述多个第一装载单元中的每个第一装载单元包括缓存子单元,所述缓存子单元的深度与所述第一装载单元装载的阵列块的深度相等;The data processing method according to claim 66, wherein each first loading unit in the plurality of first loading units comprises a buffer sub-unit, and the depth of the buffer sub-unit is the same as that of the first loading unit. The depth of the loaded array blocks is equal;
    所述对读取的所述输入特征数据进行缓存,包括:The caching of the read input feature data includes:
    通过所述缓存子单元,对从所述存储单元读取的输入特征数据进行缓存。The input feature data read from the storage unit is cached by the cache subunit.
  68. 根据权利要求66所述的数据处理方法,其特征在于,各个阵列块的高度均相等。The data processing method according to claim 66, wherein the heights of each array block are equal.
  69. 根据权利要求68所述的数据处理方法,其特征在于,一个阵列块的尺寸等于所述脉动处理阵列中所装载的一个卷积核参数的尺寸。The data processing method according to claim 68, wherein the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array.
  70. 根据权利要求42至69任意一项所述的数据处理方法,其特征在于,所述脉动处理阵列的尺寸根据所述脉动处理阵列中所装载的卷积核参数的尺寸而确定。The data processing method according to any one of claims 42 to 69, wherein the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
  71. 根据权利要求70所述的数据处理方法,其特征在于,所述脉动处理阵列的尺寸为所述脉动处理阵列中所装载的一个卷积核参数的尺寸的整数倍。The data processing method according to claim 70, wherein the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
  72. 根据权利要求42至69任意一项所述的数据处理方法,其特征在于,所述脉动处理阵列的列数为3的整数倍。The data processing method according to any one of claims 42 to 69, wherein the number of columns of the systolic processing array is an integer multiple of 3.
  73. 根据权利要求42至69任意一项所述的数据处理方法,其特征在于,所述多个第一装载单元向所述脉动处理阵列的第u行处理单元装载数据的时刻比所述多个第 一装载单元向所述脉动处理阵列的第u+1行处理单元装载数据的时刻早一个时钟周期,u为正整数。The data processing method according to any one of claims 42 to 69, wherein the time at which the plurality of first loading units load data to the processing units in the u-th row of the systolic processing array is earlier than the time at which the plurality of first loading units load data to the processing units in the u-th row of the pulsating processing array. The time when a loading unit loads data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
  74. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求42至73任意一项所述的方法。A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method described in any one of claims 42 to 73 is implemented.
  75. 一种数据处理装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求42至73任意一项所述的方法中由任一第一装载单元所执行的步骤。A data processing device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, any one of claims 42 to 73 is implemented. Steps performed by any of the first loading units in the described method.
PCT/CN2020/106556 2020-08-03 2020-08-03 Data processing apparatus, method, and system, and neural network accelerator WO2022027172A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/106556 WO2022027172A1 (en) 2020-08-03 2020-08-03 Data processing apparatus, method, and system, and neural network accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/106556 WO2022027172A1 (en) 2020-08-03 2020-08-03 Data processing apparatus, method, and system, and neural network accelerator

Publications (1)

Publication Number Publication Date
WO2022027172A1 true WO2022027172A1 (en) 2022-02-10

Family

ID=80119809

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/106556 WO2022027172A1 (en) 2020-08-03 2020-08-03 Data processing apparatus, method, and system, and neural network accelerator

Country Status (1)

Country Link
WO (1) WO2022027172A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600652A (en) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) Convolutional neural network processing device, high-speed target detection method and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103316A1 (en) * 2015-05-21 2017-04-13 Google Inc. Computing convolutions using a neural network processor
CN109416754A (en) * 2016-05-26 2019-03-01 多伦多大学管理委员会 Accelerator for deep neural network
CN110333827A (en) * 2019-07-11 2019-10-15 山东浪潮人工智能研究院有限公司 A kind of data loading device and data load method
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
US10713214B1 (en) * 2017-09-27 2020-07-14 Habana Labs Ltd. Hardware accelerator for outer-product matrix multiplication

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103316A1 (en) * 2015-05-21 2017-04-13 Google Inc. Computing convolutions using a neural network processor
CN109416754A (en) * 2016-05-26 2019-03-01 多伦多大学管理委员会 Accelerator for deep neural network
US10713214B1 (en) * 2017-09-27 2020-07-14 Habana Labs Ltd. Hardware accelerator for outer-product matrix multiplication
CN110333827A (en) * 2019-07-11 2019-10-15 山东浪潮人工智能研究院有限公司 A kind of data loading device and data load method
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115600652A (en) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) Convolutional neural network processing device, high-speed target detection method and equipment

Similar Documents

Publication Publication Date Title
US11593594B2 (en) Data processing method and apparatus for convolutional neural network
WO2018196863A1 (en) Convolution acceleration and calculation processing methods and apparatuses, electronic device and storage medium
JP2019036298A (en) Intelligent high bandwidth memory system and logic dies therefor
CN104615488A (en) Task scheduling method and device on heterogeneous multi-core reconfigurable computing platform
TW200402653A (en) Shared memory controller for display processor
TW202207031A (en) Load balancing for memory channel controllers
WO2022027172A1 (en) Data processing apparatus, method, and system, and neural network accelerator
US11467973B1 (en) Fine-grained access memory controller
US8386687B2 (en) Method and apparatus for data transfer
KR20210151250A (en) extended memory interface
US9575759B2 (en) Memory system and electronic device including memory system
CN111694513A (en) Memory device and method including a circular instruction memory queue
CN112433847B (en) OpenCL kernel submitting method and device
CN110633226A (en) Fusion memory, storage system and deep learning calculation method
CN101002272A (en) Addressing data within dynamic random access memory
JP2024516514A (en) Memory mapping of activations for implementing convolutional neural networks
TW202213127A (en) Graphics processor and acceleration method thereof
CN116360672A (en) Method and device for accessing memory and electronic equipment
CN112035056A (en) Parallel RAM access architecture and access method based on multiple computing units
US11734551B2 (en) Data storage method for speech-related DNN operations
CN112639747A (en) Addressing method of processor, movable platform and electronic equipment
TWI819428B (en) Processor apparatus
US11983128B1 (en) Multidimensional and multiblock tensorized direct memory access descriptors
US20030208671A1 (en) Data flow processor
CN111832714A (en) Operation method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20948463

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20948463

Country of ref document: EP

Kind code of ref document: A1