WO2022027172A1

WO2022027172A1 - Data processing apparatus, method, and system, and neural network accelerator

Info

Publication number: WO2022027172A1
Application number: PCT/CN2020/106556
Authority: WO
Inventors: 李鹏; 韩峰; 杨康
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2022-02-10

Abstract

Embodiments of the present disclosure provide a data processing method, apparatus, and system, and a neural network accelerator, used for loading input feature data into a systolic processing array. The apparatus comprises a plurality of first load cells, each of the plurality of first load cells configured to: access a memory cell in parallel to read the input feature data from the memory cell; cache the read input feature data; and load the cached input feature data into at least one row of processing cells in the systolic processing array.

Description

Data processing apparatus, method and system, and neural network accelerator

technical field

The present disclosure relates to the technical field of artificial intelligence, and in particular, to a data processing apparatus, method and system, and a neural network accelerator.

Background technique

Some neural networks often contain a lot of convolution processing. When performing convolution processing, it is necessary to load the input feature data and convolution kernel parameters into the pulsation processing array in the neural network, and then use the pulsation processing array to combine the input feature data with convolution. The kernel parameters are calculated to obtain the output feature data. In the traditional data loading method, it often happens that the processing units in the systolic processing array are idle, so that the processing efficiency of the systolic processing array is low.

SUMMARY OF THE INVENTION

In view of this, the embodiments of the present disclosure propose a data processing method, apparatus and system, and a neural network accelerator to solve the technical problem of low processing efficiency of a pulsating processing array in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, the plurality of first loading units Each of the first loading units is used to: access a storage unit in parallel to read input feature data from the storage unit, cache the read input feature data, and cache the cached input feature data Loaded into at least one row of processing elements in a systolic processing array.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing system, comprising the data processing apparatus described in any one of the embodiments; and a systolic processing array for loading the input feature data and convolution kernel parameters, and for The input feature data and the convolution kernel parameters are subjected to convolution processing.

According to a third aspect of the embodiments of the present disclosure, there is provided a neural network accelerator, where the neural network accelerator includes the data processing apparatus described in any embodiment, or includes the data processing system described in any embodiment.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array , the method includes: accessing a storage unit in parallel by each first loading unit in the plurality of first loading units, and reading the input feature data from the storage unit; buffering the data; and loading the buffered input feature data into at least one row of processing units in the systolic processing array.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the embodiments of the present disclosure.

According to a sixth aspect of the embodiments of the present disclosure, there is provided a data processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implements the present disclosure when executing the program Steps performed by any of the first loading units in the method of any of the embodiments.

By applying the solution of the embodiment of the present disclosure, a plurality of first loading units acquire the input feature data from the storage unit in parallel, and the number of valid data acquired by the systolic processing array is doubled compared to the way of acquiring data through only one loading unit Therefore, the idle situation of the processing units in the pulsating processing array is reduced, and the processing efficiency of the pulsating processing unit is improved.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present disclosure more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

FIG. 1 is a schematic diagram of a processing manner of a systolic processing array according to an embodiment of the present disclosure.

FIG. 2 is a schematic diagram of a data loading manner according to an embodiment of the present disclosure.

3 is a schematic diagram of a data flow process in a systolic processing array according to an embodiment of the present disclosure.

4 is a schematic diagram of a data read and load process according to an embodiment of the present disclosure.

5A and 5B are schematic diagrams of valid data read from a memory cell according to an embodiment of the present disclosure.

FIG. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.

7A and 7B are schematic diagrams illustrating changes in the amount of cached data during a data loading process according to an embodiment of the present disclosure.

FIG. 8 is a schematic structural diagram of a first loading unit according to an embodiment of the present disclosure.

9 is a schematic diagram of instantiating multiple cache units according to an embodiment of the present disclosure.

FIG. 10 is a processing flowchart of a data processing apparatus according to an embodiment of the present disclosure.

Figure 11A is a schematic diagram of the length of a conventional systolic processing array.

11B is a schematic diagram of the length of a systolic processing array in accordance with an embodiment of the present disclosure.

12 is a schematic diagram of a computer device according to an embodiment of the present disclosure.

13 is a schematic diagram of a memory cell according to an embodiment of the present disclosure.

14 is a schematic diagram of a data processing system according to an embodiment of the present disclosure.

15A and 15B are respectively schematic diagrams of neural network accelerators according to embodiments of the present disclosure.

FIG. 16 is a flowchart of a data processing method according to an embodiment of the present disclosure.

detailed description

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various pieces of information, such information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Depending on the context, the word "if" as used herein can be interpreted as "at the time of" or "when" or "in response to determining."

According to an embodiment of the present disclosure, there is provided a data processing apparatus for loading input feature data into a systolic processing array, the apparatus comprising: a plurality of first loading units, each of the plurality of first loading units A first loading unit can access a storage unit in parallel to read input feature data from the storage unit, buffer the read input feature data, and can load the buffered input feature data into a systolic process At least one row of processing units in the array.

In neural networks such as Convolutional Neural Network (CNN), many convolution processes are often included, that is, the input feature data and convolution kernel parameters (also called weight data) are convolved to obtain output feature data. The input feature data here can be an input feature map (Input Feature Map, IFM), which can come from images, voice or text, etc.; correspondingly, the output feature data is an output feature map (Output Feature Map, OFM), which can be converted for image, voice or text. When performing convolution processing, it is necessary to load the input feature data and convolution kernel parameters into the pulsation processing array in the neural network, and then the pulsation processing array performs convolution processing on the input feature data and the convolution kernel parameters to obtain the output feature data. . The systolic processing array is a simple and efficient processing device. For the systolic processing array with fixed convolution kernel parameters (Weight Stationary), the input feature data is multiplexed in the systolic processing array, which can reduce the input bandwidth requirement for the input feature data.

The working principle of a systolic processing array with fixed convolution kernel parameters is shown in Figure 1. A common systolic processing array is rectangular, including R×H processing units, that is, each row has H processing units, and each column has R processing units, and R and H may or may not be equal. The larger the size of the systolic processing array, the greater the bandwidth requirement on the input feature data. Among them, the processing unit in the i-th row and the j-th column is used to input the input characteristic data obtained in this clock cycle (cycle) to the processing unit in the i-th row and the j+1th column in the next clock cycle, and is used to send the operation result. Give the processing unit at the i+1th row and the jth column, so that the processing unit at the i+1th row and the jth column can perform the operation result of the i+1th row and the jth column of the processing unit and the i+1th row in the next clock cycle. The operation results of the processing units in the jth column are accumulated. It should be noted that the processing unit does not necessarily process each acquired input feature data. For example, the processing unit in row 1 and column 1 transmits the input feature data in row 1 and column 1 to the first After the processing unit in row 2 and column 2, the processing unit in row 1 and column 2 does not multiply the input feature data with the convolution kernel parameters obtained by this processing unit, but directly passes it to row 1 and 3 The processing unit of the column. Each processing unit can be used to perform multiplication and addition operations on the input feature data loaded into the processing unit and the convolution kernel parameters, and output the operation result to the next processing unit, and the operation result of the last row of processing units is used as the output feature data. One input feature data and one convolution kernel parameter can be loaded in one processing unit, or one processing unit can also be loaded with an input feature data block of size M×N and a convolution kernel parameter block of size K×L , R, H, M, N, K and L are all positive integers.

The first loading unit is used to load the input feature data into the corresponding processing unit, and the input feature data can be stored in a storage unit, and the storage unit can be a static random-access memory (Static Random-Access Memory, SRAM) or other types storage unit. Since the processing result of the processing unit is passed down and accumulated with the product of the processing unit in the next column of processing units, the time sequence of the input feature data of each row of processing units is delayed by one clock cycle by row, so that the processing of this row When the product of the input feature data of the unit and the convolution kernel parameters is calculated, it is exactly accumulated with the product sent by the previous line of processing units. If the input of the input feature data is stopped, the computation of at least one processing unit in the systolic processing array at the corresponding moment is also stopped.

The second loading unit is used to load the convolution kernel parameters into the corresponding processing unit, the convolution kernel parameters may be stored in the storage unit, and the storage unit for storing the convolution kernel parameters and the storage unit for storing the input feature data may be the same storage unit unit, or a different storage unit. During a specific computing cycle, the convolution kernel parameters in the processing unit will remain unchanged, and the convolution kernel parameters will be reused for different input feature data flowing into the processing unit. The loading order of the convolution kernel parameters and the input feature data is not limited here. For example, the convolution kernel parameters can be loaded earlier than the input feature data, or later than the input feature data, or at the same time as the input feature data. load. In addition, different processing units can load convolution kernel parameters at the same time, or can load convolution kernel parameters sequentially in a certain order, for example, can be loaded into the systolic processing array in the same way as the input feature data. The output unit is used for buffering the output data of the processing unit, or sending the processing result of the processing unit (ie, output characteristic data) to other processing units for further processing or storage.

Taking the loading of input feature maps and convolution kernel parameters as an example, the first loading unit can load one input feature map into the systolic processing array at a time. In order to improve processing efficiency, multiple input feature maps can also be loaded into the systolic processing array at a time. process in the array. Similarly, the second loading unit can load one set of convolution kernel parameters into the systolic processing array at a time, or can load multiple sets of convolution kernel parameters into the systolic processing array. As shown in Figure 2, assuming that the size of each group of convolution kernel parameters is 3×3, and the processing unit in the i-th row and j-th column is called Mij, the second loading unit can load four groups of convolution kernel parameters into the systolic processing unit. In the array (different colored squares in the figure represent different convolution kernel parameters). Assuming that each group of convolution kernels corresponds to three independent row data (row_x, row_y, row_z) in the input feature map, the first loading unit can input (row_x, row_y, row_z) in the first input feature map to The 3-row first processing unit of the systolic processing array inputs (row_x, row_y, row_z) in the second input feature map to the 3-row second processing unit of the systolic processing array. Wherein, the first processing units in the three rows are located before the second processing units in the three rows in the processing sequence. In this way, two input feature maps can be processed simultaneously.

Assuming that the height of the systolic processing array is H, that is, there are H rows of the systolic processing array, at most H new data flows into the systolic processing array per clock cycle. That is to say, when the systolic processing array processes the input feature data at full speed, the maximum demand rate for the input feature data is the number of H required per clock cycle, that is, each clock cycle can process each row of the systolic processing array. The cell inputs an input feature data. For example, suppose the input feature data of the i-th row and the j-th column is called aij, and H is equal to 5. As shown in Figure 3, in the first clock cycle, a11 is input to M11. In the second clock cycle, a11 is transferred from M11 to the right to M12, new data a12 is input into M11, and a21 is input into M21 at the same time. In the third clock cycle, a11 is passed to the right to M13, a12 is passed to the right to M12, the new data a13 is input to M11, a21 is passed to the right to M22, the new data a22 is input to M21, and the new data a31 is input into M31. The data in the gray box in the figure is the new data flowing into the systolic processing array every clock cycle. By analogy, starting from the 5th clock cycle, 5 new data are input to the first column of processing units in the systolic processing array each time, so that the maximum data demand rate of the systolic processing array is 5.

As shown in FIG. 4, input feature data (eg, input feature map) may be stored in a storage unit, read from a data output port of the storage unit by a first loading unit, and loaded into a systolic processing array. For example, a portion of the input signature data of the input signature map read every clock cycle from a data output port is loaded into a row of the systolic processing array. For some storage units, for example, the minimum access unit of on-chip SRAM (SRAM On Chip) is generally 2 to the nth power of bytes, where n is an integer. That is to say, at least 2 n-th power bytes of data are accessed each time. It is assumed that the bit width of the data output port of the SRAM is equal to its minimum access unit, that is, the input feature data can be accessed in the SRAM according to the minimum access unit of the SRAM. For ease of description, the technical solutions of the embodiments of the present disclosure are now described by taking the length of each data as 1 byte (ie, 8 bits) as an example. In practical applications, the length of a data may also be greater than 1 The situation when the length is other values is similar to the situation when the data length is 1 byte, and details are not described in this disclosure. Now consider the following situations:

Case 1: When the number of bits corresponding to the number of columns of the input feature map stored in the SRAM is greater than the amount of data corresponding to the minimum access unit of the SRAM, and is not equal to an integer multiple of the number corresponding to the minimum access unit, the data processing device operates at one clock. The amount of data accessed from the SRAM periodically is less than the data amount of a row of data of the input feature map stored in the SRAM, and after an access operation is performed on a row of data of the input feature map stored in the SRAM, the row of the input feature data of the row The efficiency of the access operation corresponding to the data at the tail will be reduced. As shown in FIG. 5A , assuming that the number of columns of a row of input feature data is 40 (that is, the input feature data of each row has 40 bytes), the number of bits of each input feature data in the row is 1 byte, and the minimum access unit If it is 32 bytes, then 32 bytes of data are taken out from the input feature data when accessing the SRAM in the first clock cycle, and only the last 8 bytes of data are left when accessing the SRAM in the second clock cycle, so although 32 bytes can be accessed. Byte data, however, of which only 8 bytes of data are valid data.

Case 2: The number of columns of the input feature map stored in the SRAM is relatively small, and one minimum access unit can correspond to multiple rows of input feature data. When only one row of input feature data needs to be accessed from the SRAM, although multiple rows of data can be accessed in one access operation, only one row of data is valid data. As shown in FIG. 5B , the white rectangles in the figure represent the input feature data on the 1st, 3rd, and 5th rows, and the gray rectangles represent the input feature data on the 2nd, 4th, and 6th rows. Assuming that the number of columns of a row of input feature data is 32 bytes, the number of bits of each input feature data in the row is 1 byte, and the minimum access unit is 64 bytes, then 64 bytes can be taken out when accessing the SRAM per clock cycle data (for example, including the 1st row data and the 2nd row data), however, since each time the systolic processing array is loaded with a row of data in the input feature data (that is, a row of input feature data), only the first row of data is loaded into the systolic array. 1st line of data or 2nd line of data is valid data. Therefore, the valid data is 32 bytes.

Assuming that the data volume corresponding to the height H of the systolic processing array is exactly equal to the data volume corresponding to the minimum access unit, in the first case, only the data provided by the odd-numbered clock cycle can satisfy the maximum data consumption rate of the systolic processing array , the data provided by the even-numbered clock cycle is less than the maximum data consumption rate of the systolic processing array, so that the systolic processing array cannot obtain enough valid data for each even-numbered clock cycle. In the second case, the data provided per clock cycle cannot meet the maximum data consumption rate of the systolic processing array, that is, the systolic processing array cannot obtain enough valid data per clock cycle. Further, when the height H of the systolic processing array is large (greater than the minimum access unit), the above phenomenon is more obvious.

It can be seen that in the above two cases, due to the reduction in the number of valid data acquired in one clock cycle, it is impossible to load H data for the systolic processing array in each clock cycle, resulting in some processing units in the systolic processing array being idle (no data flow), the processing efficiency is reduced. Since the pulsation processing array occupies most of the area of the neural network accelerator chip, the reduction of the efficiency of the pulsation processing array means the inefficiency of the neural network accelerator chip, which directly reduces the cost performance of the chip. Traditional data loading methods often use a systolic processing array to process input feature data of various sizes, so that when the height of the systolic processing array is not equal to the data output port of the SRAM (that is, the height of a systolic processing array corresponds to When the size of the data is not equal to the size of the data output from the data output port of the SRAM), it cannot be guaranteed that the input feature data provided to the systolic processing array in each clock cycle is as close to the height H of the systolic processing array as possible.

Based on this, an embodiment of the present disclosure provides a data processing apparatus for loading input feature data into a systolic processing array. As shown in FIG. 6 , the apparatus may include:

A plurality of first loading units 602, each first loading unit of the plurality of first loading units 602 is used for:

accessing memory cells in parallel to read input feature data from the memory cells,

buffering the read input feature data, and

The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.

In one embodiment, the size of the systolic processing array may be determined according to the size of the convolution kernel parameters loaded in the systolic processing array, that is, a suitable size is selected according to the size of the convolution kernel parameters to be loaded. Systolic processing array. In order to improve the utilization of processing units in the systolic processing array, the size of the systolic processing array may be an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array. For example, if the convolution kernel parameters loaded in the systolic processing array are all K×L, the number of rows of the systolic processing array is an integer multiple of K, and the number of columns is an integer multiple of L.

In the embodiment of the present disclosure, the systolic processing array can be divided into several array blocks according to the dimension of "row", each array block includes one or more rows of processing units, and the heights of each array block can be equal. Optionally, the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array. That is to say, one array block corresponds to one convolution kernel parameter. Or, optionally, the size of one array block is equal to the sum of the sizes of multiple convolution kernel parameters loaded in the systolic processing array. For example, in FIG. 6 , every 3 rows of processing units in the systolic processing array are regarded as one array block, so that the systolic processing array can be divided into two array blocks, array block 1 and array block 2 . Among them, each small square represents a processing unit.

Each first loading unit can load input feature data into one array block in the systolic processing array, ie, each array block uses one first loading unit independently. In one embodiment, the size of the input feature data loaded into one array block by the first loading unit in each clock cycle is equal to the size of the data corresponding to the height of one array block. Each first loading unit in the plurality of first loading units 602 may be connected to a first interface, and the first interface may be various types of interfaces, for example, an Application Programming Interface (Application Programming Interface, API), Remote Procedure Calls (RPC), or Remote Method Invocation (RMI). Each first loading unit in the plurality of first loading units 602 can access a storage unit for storing input feature data in parallel through a first interface provided by itself. The storage unit may be an SRAM or other type of storage device.

Optionally, the storage unit may include a plurality of data output interfaces, and the first interface of each first loading unit communicates with a data output interface of the storage unit to obtain data from the corresponding data output interface of the storage unit. Get input feature data. To facilitate communication, the data output interface may be the same type of interface as the first interface. Optionally, the storage unit may also include only one data output interface, and the first interface of each first loading unit communicates with the same data output interface of the storage unit to obtain data from the storage unit. The output interface obtains input feature data.

The time when the plurality of first loading units 602 load the input feature data to the processing unit in the u-th row of the systolic processing array is faster than the time when the plurality of first loading units 602 process the u+1-th row of the systolic processing array. The time when the unit loads the input feature data is one clock cycle earlier, and u is a positive integer. In one embodiment, the time at which one of the plurality of first loading units 602 loads the input feature data to the processing unit in the u-th row of the systolic processing array is earlier than that of the first loading unit 602 of the plurality of first loading units 602 . The time when another first loading unit loads the input feature data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.

Communication can be performed between the plurality of first loading units 602, so that the output timing of the input feature data is maintained on a slope, that is, the input feature data received by the systolic processing array in the current row is higher than the input feature data received by the systolic processing array in the previous row. Delayed by one clock cycle, when the product of the input feature data received by the current row systolic processing array and the convolution kernel parameter is calculated, it is exactly accumulated with the product sent by the previous row of the systolic processing array. In some embodiments, each of the plurality of first loading units 602 may be provided with a second interface through which adjacent first loading units communicate. Wherein, the previous first loading unit (for example, the kth first loading unit) may send data to the next first loading unit (for example, the kth first loading unit) through the second interface when the present loading unit completes loading data into the systolic processing array. , the k+1 th first loading unit) sends a synchronization signal, and the next first loading unit starts to load data into the systolic processing array after receiving the synchronization signal. The type of the second interface may be the same as or different from that of the first interface.

In some embodiments, the sum of the buffering rates of the input feature data is greater than or equal to the loading rate of the systolic processing array. For example, the sum of the cache rates of the input feature data in each clock cycle may be greater than or equal to the loading rate of the systolic processing array in the clock cycle, or the sum of the average cache rates of the input feature data. The sum is greater than or equal to the average loading rate of the systolic processing array, and may also be that the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array. The average buffering rate is the ratio of the sum of the number of input feature data buffered in all clock cycles (including the current clock cycle) before the current clock cycle to the number of all the always cycles. The average loading rate is the ratio of the sum of the number of input feature data loaded in all clock cycles (including the current clock cycle) before the current clock cycle to the number of all the constant cycles. For example, if the current clock cycle is the second clock cycle, the input feature data buffered in the first clock cycle is 4 bytes, and the input feature data buffered in the second clock cycle is 2 bytes, the average buffer rate is (4+ 2)/2=3 bytes/clock cycle. Since the rate of data caching is greater than the rate of data consumption, it can be guaranteed that the entire column of pulsed processing can always obtain enough data. In other embodiments, the input feature data is not loaded into the systolic processing array until each first loading unit is loaded. Through any of the above methods, the first loading unit can be used as a data pool to summarize the input feature data, and then stably output the data to the systolic processing array.

Each data output interface of the storage unit can output part or all of the input feature data in the same row to the first interface of the first loading unit within one clock cycle, for example, the data output interface 1 of the storage unit The input characteristic data of the first row is output to the first interface 1 of the first loading unit in the first clock cycle, and the data output interface 2 of the storage unit is to the first interface 2 of the first loading unit in the first clock cycle Output the input feature data for line 2, and so on.

Each of the first loading units may include at least one buffer sub-unit for buffering the input feature data acquired from the storage unit. In the case where the first loading unit includes multiple cache sub-units, each cache sub-unit in the multiple cache units can sequentially acquire the input feature data from the storage unit in a certain order and cache it, and cache the data in the cache. The input feature data of are sequentially loaded into at least one row of processing elements of the systolic processing array. Each cache subunit of the plurality of cache subunits can start to cache the input feature data corresponding to the current cache subunit when the input feature data corresponding to the previous cache subunit is cached. The loading of the previous cache subunit in the loading sequence is performed in parallel with the cache process of the latter cache subunit. For example, when the number of cache subunits is 2, in the first clock cycle, data can be cached through cache subunit 1, and when the cache subunit 1 is cached, the data in cache subunit 1 is loaded into systolic processing array. While the data in the cache subunit 1 is loaded, the data cache can be performed by the cache subunit 2 . When the buffering of the buffering subunit 2 is completed, the data in the buffering subunit 2 is loaded into the systolic processing array. While the data in the cache sub-unit 2 is loaded, the data can be cached by the cache sub-unit 1, and so on and so forth. When the number of cache subunits is greater than 2, the caching and loading methods are similar to the above-mentioned cases, and are not repeated here.

The cache subunit may include a plurality of cache blocks, and each cache block in the plurality of cache blocks is used to cache input feature data required by a row of processing units in the systolic processing array, and the loading subunit uses Loading the input feature data of each cache block cache in the plurality of cache blocks into a corresponding row of processing units in the systolic processing array; wherein, after the input feature data of the vth cache block cache is loaded, it can be Send a first load instruction to the v+1th cache block, so that the loading subunit starts to load the input feature data cached in the v+1th cache block, where v is an integer greater than 1.

After the first loading unit completes the buffering of the input feature data into the corresponding buffer subunit, the first loading unit may load the input feature data buffered in the buffer subunit into the systolic processing array. When loading the input feature data, each first loading unit in the plurality of cache subunits reads one input feature data from the corresponding cache subunit within one clock cycle and loads it into the systolic processing array. Since multiple first loading units acquire the input feature data from the storage units in parallel, the amount of data acquired per clock cycle is multiplied compared to the way of acquiring data through only one loading unit. Therefore, the systolic processing array The quantity of acquired valid data also increases exponentially, thereby reducing the idle situation of the processing units in the systolic processing array and improving the processing efficiency of the systolic processing unit. The following two situations are respectively analyzed to illustrate the technical effects of the embodiments of the present disclosure:

For case 1, it is assumed that the number of columns of a row of input feature data is 40 bytes, the number of bits of each input feature data in the row is 1 byte, and the minimum access unit is 32 bytes, that is, the number of columns of input feature data corresponds to The number of bytes is greater than the number of bytes corresponding to the minimum access unit of SRAM. As shown in Fig. 7A, at the beginning, no valid data is loaded in each loading unit. In the first clock cycle, each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S702, the two first loading units cache a total of 64 bytes of valid data in the first clock cycle; if in the second clock cycle, each first loading unit can add a new cache of 8 bytes of valid data. data. Therefore, the two first loading units add a total of 16 bytes of valid data to be buffered in the second clock cycle. In step S706, the two first loading units buffer a total of 48 bytes of valid data in the second clock cycle. Assuming that the height H of the systolic processing array is equal to the number corresponding to the minimum access unit (ie, H=32), and the number of input feature data consumed in the first clock cycle and in the second clock cycle is H, then In step S702 corresponding to the first clock cycle, take out the 32-byte input feature data from the cache and load it into the pulsation processing unit, and in step S704, the two first loading units cache the remaining 32-byte input feature data; In step S706 corresponding to 2 clock cycles, the two first loading units take out the 32-byte input feature data from the cache and load it into the pulsating processing unit. In step S708, the two first loading units cache the remaining 16 bytes of input feature data. . Therefore, according to the solutions of the embodiments of the present disclosure, the data loading requirements of the systolic processing array can be satisfied.

For the second case, the number of columns of the input feature data is relatively small. As shown in FIG. 7B , assuming that the number of columns of input feature data is 32 bytes, each input feature data is 1 byte, the minimum access unit is 64 bytes, and assuming that the number of first loading units is 2, then one The number of valid data buffered by the first loading unit in each clock cycle is 64 bytes. Initially, no valid data is loaded into each load unit. In the first clock cycle, each first loading unit can newly buffer 32 bytes of valid data. Therefore, in step S712, the two first loading units buffer a total of 64 bytes of valid data in the first clock cycle. In step S716, the two loading units buffer a total of 64 bytes of valid data in the second clock cycle. Assuming that the height H of the systolic processing array is equal to the number corresponding to the minimum access unit (ie, H=64), and the number of input feature data consumed in the first clock cycle and in the second clock cycle is H, then In step S712 corresponding to the first clock cycle, the 64-byte input feature data is taken out from the cache and loaded into the pulsation processing unit. In step S714, the two first loading units have no buffered data; in the second clock cycle corresponding to In step S716, the two first loading units take out the 64-byte input feature data from the cache and load them into the pulsation processing unit. In step S718, the two first loading units cache no buffered data. It can be seen that the buffering rate of the two first loading units is equal to the data consumption rate of the systolic processing array. Therefore, the systolic processing array can always obtain enough input feature data.

It can be seen that, due to the use of multiple first loading units 602, the number of input feature data buffered in each clock cycle is doubled, so that the buffered data can always provide enough data for data loading, improving the The processing efficiency of the systolic processing array is improved. Especially when the height of the systolic processing array is large, in this way, data can be efficiently input to the systolic processing array, so that the amount of data loaded in each clock cycle is as close as possible to the height of the systolic processing array, so that the systolic processing array can be The expansion of the height dimension is not limited, which is conducive to the flexible design of high-performance systolic processing arrays of different sizes.

Since the number of columns of the input feature map is generally relatively large, and the cache capacity of the first loading unit is limited, in practical applications, the input feature map can be divided into blocks to obtain multiple data blocks. The number is less than the total number of columns of the input feature data, and the number of rows of each data block is less than or equal to the total number of rows of the input feature data. In this case, only one data block is cached and loaded at a time, and the next data block can be cached after the cached data block is loaded. Optionally, the number of columns of the data block may be equal to the number of data corresponding to the minimum access unit of the storage unit. For example, if the minimum access unit of the storage unit is 32 bytes, and each data is 1 byte, the number of columns of the data block is 32 columns. Optionally, the number of columns of the data block may also be equal to an integer multiple of the number of data corresponding to the minimum access unit of the storage unit.

In some embodiments, each first loading unit in the plurality of first loading units may further perform first filling on the input feature data during the process of loading the input feature data. Since each first loading unit can be filled in the same way, the following will take the filling method of one of the first loading units (referred to as loading unit A) as an example for description. a loading unit. The loading unit A may first obtain the description information of the input feature data, and the description information may include information used to indicate whether the input feature data needs to be filled, for example, use "0" or a null value to indicate that no filling is required, Use "1" to indicate that padding is required. When padding is required, padding parameters may also be included, for example, the value of padding data, and the number of rows and/or columns of padding data.

The loading unit A can determine whether the input feature data needs to be filled according to the description information, if necessary, in the data loading process, can first determine whether the currently loaded data is input feature data or fill data, if it is input feature data, Then directly read the corresponding data from the cache to load; if it is filling data, generate filling data and load according to filling parameters. In practical applications, the above-mentioned first filling may include filling at least one of the left boundary and the right boundary of the input feature data. Filling the left border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; filling the right border of the input feature data refers to adding at least one column of fill data before the first column of the input feature data; Add at least one column to fill the data after the last 1 column of data.

In some embodiments, the first loading unit includes: a sending subunit, configured to send a read instruction to the storage unit; a cache subunit, configured to respond to the input returned by the storage unit according to the read instruction The feature data is buffered; and a loading subunit is used to load the buffered input feature data into at least one row of processing units in the systolic processing array.

The sending subunit may include a first parsing subunit for receiving a loading instruction and parsing the loading instruction to generate description information of the input feature data to be loaded; and a second parsing subunit for analyzing the loaded instruction. The description information is parsed, and the read instruction is sent to the storage unit according to the analysis result. In one embodiment, the first parsing subunit may receive a loading instruction from the controller. However, the present invention is not limited to this. According to other embodiments of the present invention, the first parsing subunit may also receive loading instructions from other apparatuses or storage devices. The description information required by a first loading unit may specifically include some or all of the following:

The number of groups of input feature data to be loaded (ifm_num), which is used to indicate the total number of groups of input feature data that the first loading unit is responsible for loading. When the input feature data is an input feature map, an input feature map is a set of input feature data.

The number (ifm_start) of the first group of input feature data to be loaded, which is used to indicate the identification information of the first group of input feature data that the first loading unit is responsible for loading. The numbers of the remaining input feature data that the first loading unit is responsible for loading increases sequentially. For example, ifm_start is equal to 1, the number of the second group of input feature data is 2, the number of the third group of input feature data is 3, and so on. .

The first loading unit is responsible for the number of input feature data groups (ifm_num_perb) that can be processed simultaneously by an array block in the systolic processing array (generally including multiple rows of processing units in the systolic processing array), since the convolution processing is performed during the convolution processing. In the process, it is sometimes necessary to accumulate multiple sets of input feature data and the convolution processing results of the corresponding convolution kernel parameters, so the pulsation processing array needs to be able to process multiple sets of input feature data at the same time. For example, the input feature map includes R (Red) channel map, G (Green) channel map, and B (Blue) channel map. Three sets of convolution kernel parameters are used to convolve the three channel maps, and then the three channel maps are convolved. The convolution processing of the graph is accumulated, so one array block in the systolic processing array can be used to load three sets of convolution kernel parameters respectively, and the first loading unit can respectively load the above three channel graphs for these three array blocks. In this case, ifm_num_perb is equal to 3.

The number of groups of input feature data (ifm_num_perb_total) processed simultaneously by the entire systolic processing array, that is, the sum of ifm_num_perb corresponding to each first loading unit.

The height of the convolution kernel (kernel_h), indicating the height of the convolution kernel parameter in the array block that the first loading unit is responsible for loading. It should be noted that the convolution kernels in the array block that the same first loading unit is responsible for loading may include all rows in a complete set of convolution kernel parameters, or may only include some rows in a complete convolution kernel. . For example, the size of a set of convolution kernel parameters is 3 × 3, and the height of the array block that a first loading unit is responsible for loading may be 5, and these 5 rows of array blocks may include three of the first set of convolution kernel parameters. row data, and two rows of data in the second set of convolution kernel parameters. In some embodiments, the depth of the buffer subunits included in the first loading unit may be equal to the depth of the array block that the first loading unit is responsible for loading.

The starting position (kernel_h_start) of the array block that the first loading unit is responsible for loading is used to indicate the row number of the first row processing unit that the first loading unit is responsible for loading in the pulsating processing unit.

The base address (ifm_baddr) of the input feature data to be loaded, which is used to indicate the first address in the storage unit of the input feature data that the first loading unit is responsible for loading.

The size (ifm_len) of the storage space occupied by each set of input feature data to be loaded, used for addressing each set of input feature data. For example, the address of the second group of input feature data is the sum of the bit lengths corresponding to the base address and ifm_len.

The width of the input feature data to be loaded (ifm_w), indicating the total number of columns of a set of input feature data. When the width is greater than the minimum access unit of the storage unit, the same row of data in the input feature data can be loaded in multiple times.

According to actual needs, the description information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here. The description information is used to let the first loading unit know how to obtain the input feature data from the storage unit.

In some embodiments, the second parsing subunit is further configured to: generate auxiliary loading information, where the auxiliary loading information is used to determine a loading mode of the input feature data. The auxiliary loading information may include some or all of the following information:

The sliding step size (stride_h) of the convolution kernel parameter in the column direction of the input feature data that the first loading unit is responsible for loading, which is used to indicate that when performing convolution processing, the convolution kernel slides down on the input feature data each time How many lines, when stride_h is equal to 1, slide down by 1 line, when stride_h is equal to 2, slide down by 2 lines, and so on. This parameter allows the first loading unit to know which rows of input feature data need to be loaded into the systolic processing array.

The height of the dilated convolution kernel (dilate_h), which is used to indicate the interval of the number of lines of the input feature data convolved with the convolution kernel parameters. For example, when the convolution kernel is 3×3, if dilate_h is 1, the interval is 1, that is, each row (eg, row 1, row 2, and row 3) in the input feature data is divided into Convolve with 3 lines of data in the convolution kernel parameter; if dilate_h is 2, the interval is 2, that is, the interlaced (e.g., line 1, line 3, and line 5) data in the input feature data Convolve with the 3 lines of data in the convolution kernel parameters, respectively.

The height of the input feature data (ifm_h), used to represent the total number of rows of a set of input feature data.

The height of the output feature data (ofm_h), used to represent the total number of rows of a set of output feature data.

The number of padding data rows (pad_t) to pad the upper boundary of the input feature data.

The number of padding data rows (pad_b) to pad the lower boundary of the input feature data.

The number of padding data rows (pad_l) to pad the left border of the input feature data.

The number of padding data rows (pad_r) to pad the right edge of the input feature data.

The above pad_t, pad_b, pad_l and pad_r are all called padding parameters. According to actual needs, the auxiliary loading information may include at least one of the above information, and may also include other information other than the above information, which will not be repeated here. In some embodiments, the cached input feature data may also be rearranged according to the auxiliary loading information, so that the cache mode of the input feature data matches the loading mode, thereby improving data loading efficiency. For example, input feature data loaded to different line processing units can be cached in different cache addresses of cache subunits through rearrangement, or data required for loading can be filtered from the cached input feature data according to loading needs.

In some embodiments, the cache subunit includes: a first cache subunit for caching the auxiliary loading information; and a second cache subunit for reading from the first cache subunit the auxiliary loading information, and cache the input feature data returned according to the read instruction according to the auxiliary loading information. Wherein, the first cache subunit may be a First In First Out (First In First Out, FIFO) queue.

In some embodiments, the second cache subunit includes: a third cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction; The auxiliary loading information is read in the first cache subunit, the input feature data buffered by the third cache subunit is rearranged according to the auxiliary loading information, and the rearranged input feature data is written into a fourth buffering subunit; and a fourth buffering subunit for buffering the rearranged input feature data for the loading subunit to load the rearranged input feature data into the systolic processing array middle. Wherein, the third buffer subunit may be a FIFO queue, and the fourth buffer subunit may be a random access memory (Random Access Memory, RAM).

In some embodiments, the read-write subunit is further configured to: in the process of rearranging the input feature data buffered by the first cache subunit, perform a second filling on the input feature data. The second padding includes padding at least one of an upper boundary and a lower boundary of the input feature data.

A schematic structural diagram of the first loading unit according to the embodiment of the present disclosure is shown in FIG. 8 . Wherein, the first parsing subunit 802 (denoted as IFM_SBLK_INFO in the figure) is used to receive an instruction, parse the instruction according to the requirements of the pulsating processing array for input feature data, and generate the description information to the second parsing subunit 804 ( The figure is marked as IFM_SBLK_RD).

The second parsing subunit 804 is used for parsing the description information, sending an instruction to read the input feature data to the storage unit, and the read-back input feature data is written into the third buffer subunit 806 (referred to as IFM_FIFO in the figure). And during the parsing process, auxiliary loading information is generated to the first buffer subunit 814 (indicated as INFO_FIFO in the figure).

The read-write subunit (RD_FIFO_WR_RAM) 808 is used to read data from the third cache subunit 806 according to the auxiliary loading information generated by the second analysis subunit 804, and load it into the fourth cache subunit 812 (denoted as IFM_RAM in the figure). ). This process completes both upper and lower boundary padding.

The loading subunit 810 (marked as WR_IFM_2MAC in the figure) is used to read data from the fourth buffer subunit 812 according to the requirements of the pulsating processing array for input feature data, and send it to the pulsating processing array in turn. This process completes the left boundary filling at the same time. and right border padding.

The third cache subunit 806 is used for storing the input feature data read out from the storage unit. That is, the third buffering subunit 806 buffers the input feature data sent by the storage unit, and loads the input feature data buffered in the third buffering subunit into the fourth buffering subunit 812 . The width of the third cache sub-unit 806 is consistent with the bit width of the data port of the storage unit, that is, the smallest access unit of the storage unit.

The third cache subunit 806 is used for storing the read input feature data and loading the read input feature data to the information of the fourth cache subunit 812 .

In some embodiments, the number of the fourth cache sub-units is multiple, and the multiple fourth cache sub-units sequentially cache the rearranged input feature data; the loading sub-unit is used to sequentially The input feature data in each of the plurality of fourth buffer sub-units is loaded into the systolic processing array. In the case where there are multiple cache subunits, each cache subunit may use a fourth cache subunit respectively, and the remaining functional units may share the same. Each fourth cache subunit of the plurality of fourth cache subunits starts to cache the rearranged input feature data when the previous fourth cache subunit is cached.

Further, each of the fourth cache subunits in the plurality of fourth cache subunits may also include a plurality of fifth cache subunits, and each of the plurality of fifth cache subunits included in the fourth cache subunit. The height of the fifth cache subunit is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.

In some embodiments, as shown in FIG. 9 , the number of the fourth cache subunits (that is, the IFM_RAM in the figure) is 3, which are respectively called the cache subunits 902 (denoted as ping_ram in the figure) and the cache subunits 904 (denoted as pong_ram in the figure) and cache subunit 906 (denoted as peng_ram in the figure). The fourth cache subunit stores the input characteristic data according to the requirement of the systolic processing array for the input characteristic data. Among them, ping-pong-peng three groups of the same IFM_RAM can work in pipeline to speed up the loading of the pulsating processing array. The depth of each group of fourth cache subunits is equal to the height h of one array block in the systolic processing array, and each group of fourth cache subunits includes several small RAMs (eg, ping_ram0) of depth h, the width of which is the same as that of the input The bit width of the characteristic data is the same. The present disclosure uses 8 bits for illustration, but the present disclosure is applicable to other bit widths as well. For different application scenarios, it is not necessary to instantiate three sets of IFM_RAM. For example, when performance requirements are low, only one group of IFM_RAM may be instantiated; for another example, when performance requirements are high, more groups of IFM_RAM may be instantiated until increasing IFM_RAM can no longer improve performance.

Each first loading unit can be synchronized through a synchronization signal out_sync, which identifies whether the corresponding cache subunit 902, cache subunit 904, and cache subunit 906 in each first loading unit have loaded data and are ready to be output to the pulsating process array. Only when the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 in each of the first loading units is loaded with data, the selection unit 910 will save the data from the cache sub-unit 902 , the cache sub-unit 904 or the cache sub-unit 906 Select the corresponding data in the systolic processing array, and output this round of data to the systolic processing array. In one embodiment, each of the plurality of first loading units is capable of sequentially reading a plurality of data blocks in the input feature data from the storage unit, and at least one data block in the plurality of data blocks is When the block cache is completed, the at least one data block can be loaded into an array block in the systolic array.

As shown in FIG. 10 , it is a processing flow chart of the data processing apparatus according to the embodiment of the present disclosure. It should be noted that this processing flow is only a schematic diagram of a possible processing manner of the data processing apparatus of the present disclosure, and is not used to limit the present disclosure.

Step 1002: Receive an instruction, where the instruction can carry at least one of the above-mentioned description information.

Step 1004: Parse the instruction to obtain the description information.

Step 1006: Read row data according to the minimum access unit of the storage unit. That is, starting from the line corresponding to the convolution kernel height starting position kernel_h_start in the first input feature map ifm_start, each line accesses the minimum access unit once until the kernel_h line data corresponding to one convolution kernel is accessed, and continues to access the next The kernel_h row data of the next set of input feature data corresponding to the convolution kernel, each row is accessed once the minimum access unit. Among them, kernel_h is the height of the convolution kernel.

Step 1008 : According to the steps of Step 1006 , complete the row access corresponding to the number of ifm_num_perb input feature data mapped to the input feature map of the array block at one time, and each row access corresponds to a minimum access unit. In one embodiment, ifm_num_perb is the number of input feature maps corresponding to one mapping of the array block.

Step 1010: Execute step 1006 and step 1008 in a loop, visit the minimum access unit once per row, until the current minimum access unit can cover the entire row, then the convolution kernel sliding step stride_h corresponding to kernel_h in an input feature map is completed. One input feature data loading with atrous convolution kernel height dilate_h. The convolution kernel parameters need to move the output feature map height ofm_h times in total on the input feature data.

Step 1012: Circularly execute steps 1006 to 1010 to complete the moving scan of the convolution kernel parameters of the output feature map height ofm_h times on the input feature data. So far, the loading of all input feature data corresponding to ifm_num_perb is completed.

Step 1014 : Steps 1006 to 1012 are executed cyclically to complete the loading of all the input feature maps ifm with the number ifm_num_perb of the input feature maps mapped at one time by the array block.

Step 1016: End.

In some embodiments, the number of columns of the systolic processing array is an integer multiple of three. Since the minimum access unit of storage units such as SRAM is 2 to the nth power of bytes, the dimensions of most of the systolic processing arrays in the industry are also 2 to the nth power of bytes. However, in the current convolutional neural network, the convolution kernel of size 3 × 3 accounts for a high proportion of the convolution kernels of various sizes, and the 3 × 3 convolution kernel is mapped to the pulsation of the n-th power byte dimension of 2 When processing the array, some processing units cannot be mapped to, resulting in a waste of pulsating processing array resources. However, designing the number of columns of the systolic sequence array to be an integer multiple of 3 can effectively reduce resource waste in most cases, thereby improving the processing efficiency of the systolic processing array.

For example, in the systolic processing array of 2 to the nth power, when n=5, the size of the systolic processing array is 32×32 (that is, each row and each column of the systolic processing array has 32 processing units) , which is also the size of the pulsation processing array of the common 1K scale CNN accelerator in the industry, but for a 3×3 convolution kernel, there are always two rows and two columns of such a pulsation processing array that cannot be used, resulting in a valuable pulsation processing array. Waste of computing resources. As shown in FIG. 11A , each small square represents a processing unit 1106, a plurality of first loading units 1102 respectively load input feature maps IFM0 to input feature maps IFM9, and a plurality of second loading units 1104 respectively load weights Weight 0 to weight Weight 9. The plurality of first loading units 1102 transmit respective corresponding input feature maps to the corresponding plurality of processing units 106 at different times. The corresponding plurality of processing units are located in one array block. That is, the plurality of first loading units 1102 transmit respective corresponding input feature maps to corresponding array blocks at different times. For example, at time t0, the first loading unit 1102 loaded with the input feature map IFM0 transmits the input feature map IFM0 to the n-row processing unit 1106 adjacent to the first loading unit 1102, where n is the number of rows. Preferably, n corresponds to the size of the convolution kernel. In one embodiment, n is equal to 3 when the size of the convolution kernel is 3×3. That is, n is equal to the width or length of the convolution kernel. It should be noted that the time when the two adjacent first loading units 1102 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle. In addition, the plurality of second loading units 1104 transmit their corresponding weights to the processing unit 1106 at different times. For example, at time t0, the second loading unit 1104 loaded with the weight Weight 0 transmits the weight Weight 0 to the n-row processing unit 1106 adjacent to the second loading unit 1104, where n is the number of rows. Preferably, n corresponds to the size of the convolution kernel. In one embodiment, n is equal to 3 when the size of the convolution kernel is 3×3. That is, n is equal to the width or length of the convolution kernel. It should be noted that the time when the two adjacent first loading units 1104 transmit their corresponding feature maps to the corresponding processing unit 1106 is separated by one clock cycle. If the size of the convolution kernel is 3×3 (ie, kij(1≤i≤3, 1≤j≤3)), every three rows of processing units 1106 in the above-mentioned systolic processing array can load up to 10 groups of convolution kernels parameter. However, the last two rows and the last two columns of the systolic processing array cannot be used to load the above-mentioned 3×3 convolution kernel, so this part of the processing unit 1106 is idle, indicated by “0”, and the corresponding first loading unit and the status of the second loading unit is also idle. Therefore, ideally, the utilization of the systolic processing array is 30/32 = 93.75%.

If the size of the systolic processing array is designed according to the dimension of an integer multiple of 3, the size of the systolic processing array in the above example can be optimized from 32×32 to 33×33, so that the 3×3 convolution kernel can convert 33×33 All processing units are utilized. Please refer to FIG. 11B , which is a schematic diagram of the length of a systolic processing array according to an embodiment of the present disclosure. It should be noted that the same elements in FIG. 11B as those in FIG. 11A have the same or similar functions as those in FIG. 11A . As shown in FIG. 11B , each small square in the figure still represents a processing unit 1106, then each group of 3×3 processing units 1106 in the above-mentioned systolic processing array can be used to load a 3×3 convolution kernel , one convolution kernel parameter kij can be loaded in each processing unit 1106, and there is no idle processing unit 1106. Therefore, ideally, the utilization of the systolic processing array is 100%.

The above-mentioned data processing apparatus may be a processing chip, for example, an SoC chip, or may be a computer device. FIG. 12 shows a schematic diagram of the hardware structure of a more specific data processing apparatus provided by an embodiment of this specification. The apparatus may include: a processor 1202 , a memory 1204 , an input/output interface 1206 , a communication interface 1208 and a bus 1210 . The processor 1202 , the memory 1204 , the input/output interface 1206 and the communication interface 1208 realize the communication connection among each other within the device through the bus 1210 .

The processor 1202 can be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. program to implement the technical solutions provided by the embodiments of this specification.

The memory 1204 can be implemented in the form of ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, and the like. The memory 1204 can store the operating system and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 1204 and invoked by the processor 1202 for execution.

The input/output interface 1206 is used for connecting input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The communication interface 1208 is used to connect a communication module (not shown in the figure), so as to realize the communication interaction between the device and other devices. The communication module may implement communication through wired means (eg, USB, network cable, etc.), or may implement communication through wireless means (eg, mobile network, WIFI, Bluetooth, etc.).

The bus 1210 includes a path that transfers information between the various components of the device (eg, the processor 1202, the memory 1204, the input/output interface 1206, and the communication interface 1208).

It should be noted that, although the above-mentioned device only shows the processor 1202, the memory 1204, the input/output interface 1206, the communication interface 1208 and the bus 1210, in the specific implementation process, the device may also include necessary components for normal operation. other components. In addition, those skilled in the art can understand that the above-mentioned device may only include components necessary to implement the solutions of the embodiments of the present specification, and does not necessarily include all the components shown in the figures.

As shown in FIG. 14, an embodiment of the present disclosure further provides a data processing system 1400, which may include the data processing apparatus (eg, data processing apparatus 1406) of any of the above-mentioned embodiments, and a systolic processing array 1408 for loading the data processing apparatus 1408. Input feature data and convolution kernel parameters, and perform convolution processing on the input feature data and the convolution kernel parameters.

In some embodiments, the system further includes: a second loading unit 1404 for loading the convolution kernel parameters into the systolic processing array.

In some embodiments, the system further includes: a storage unit for storing the input feature data.

In some embodiments, as shown in FIG. 13 , the storage unit 1302 includes: a plurality of mutually independent storage subunits, each storage subunit is used to store part of the data in the input feature data; the plurality of storage subunits The first loading unit is configured to access different storage subunits at the same time to acquire input feature data stored in the corresponding storage subunits.

Further, the storage unit 1302 further includes a scheduling unit 1304, configured to receive the access requests of the multiple first loading units, and send the access requests to the corresponding storage sub-units 1306, so that the multiple first loading units The first load unit accesses the corresponding storage subunit 1306 .

In some embodiments, the system further includes: an output unit, configured to acquire the processing result output by the systolic processing array, and store the processing result, or output the processing result.

In some embodiments, the system may be implemented based on a Field Programmable Gate Array (FPGA) or an integrated chip (Application Specific Integrated Circuit, ASIC).

As shown in FIG. 15A and FIG. 15B , an embodiment of the present disclosure further provides a neural network accelerator 1500, characterized in that, the neural network accelerator includes the data processing apparatus 1502 described in any embodiment of the present disclosure, or includes the present disclosure The data processing system 1504 of any embodiment.

In some embodiments, the neural network accelerator is a CNN accelerator or a Recurrent Neural Network (RNN) accelerator.

As shown in FIG. 16 , the present disclosure also provides a data processing method applied to each first loading unit in a data processing apparatus including a plurality of first loading units to load input feature data into a systolic processing array, The method includes:

Step 1602: Access a storage unit in parallel through each of the plurality of first loading units, and read input feature data from the storage unit;

Step 1604: Cache the read input feature data; and

Step 1606: Load the buffered input feature data into at least one row of processing units in the systolic processing array.

In some embodiments, the sum of the buffering rates of valid data in the input feature data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.

In some embodiments, the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or the sum of the average buffering rates of the input feature data The sum is greater than or equal to the average loading rate of the systolic processing array; or the sum of the average buffering rates of the input feature data is greater than or equal to the maximum loading rate of the systolic processing array.

In some embodiments, the method further includes: in the process of loading the input feature data, performing a first filling on the input feature data.

In some embodiments, the performing the first padding on the input feature data includes: padding at least one of a left border and a right border of the input feature data.

In some embodiments, the accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit, includes: reading the input feature data from the storage unit each time The unit reads a data block in the input feature data; the loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes: when the data block is cached, The data block is loaded into at least one row of processing units in the systolic processing array.

In some embodiments, the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.

In some embodiments, each of the plurality of first loading units includes a sending subunit, a buffering subunit and a loading subunit; the passing through each of the plurality of first loading units The first loading unit accesses the storage unit in parallel, and reads the input characteristic data from the storage unit, including: sending a read instruction to the storage unit through the sending subunit; buffering the input feature data returned by the read instruction; and loading the buffered input feature data into at least one row of processing units in the systolic processing array through the loading subunit.

In some embodiments, the height of the cache subunit is equal to the height of the systolic processing array.

In some embodiments, when the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the height of the corresponding array block.

In some embodiments, the number of the cache sub-units is multiple; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: through the multiple cache sub-units, sequentially The buffering according to the input characteristic data returned by the read instruction; and the loading of the buffered input characteristic data into at least one row of processing units in the systolic processing array by loading subunits includes: sequentially loading the Input feature data in each of the plurality of cache subunits is loaded into the at least one row of processing units of the systolic processing array.

In some embodiments, the step of sequentially caching the input feature data returned according to the read instruction through the multiple cache subunits includes: a previous cache in the multiple cache subunits When the buffering of the input feature data corresponding to the subunits is completed, the buffering of the input feature data corresponding to the current buffering subunit among the plurality of buffering subunits is started.

In some embodiments, the cache subunit includes a plurality of cache blocks; the caching of the input feature data returned according to the read instruction by the cache subunit includes: passing through the cache blocks in the plurality of cache blocks. Each cache block buffers the input characteristic data required by one row of processing units in the systolic processing array; the loading subunit loads the buffered input characteristic data into at least one row of processing units in the systolic processing array in the systolic processing array, including: loading the input feature data buffered by each buffer block in the plurality of buffer blocks into a corresponding row of processing units in the systolic processing array through the loading subunit; wherein, in the vth buffer After the input feature data of the block cache is loaded, a first load instruction is sent to the v+1th cache block, so that the loading subunit starts to load the input feature data of the v+1th cache block cache. ; where v is an integer greater than 1.

In some embodiments, accessing a storage unit in parallel by each of the plurality of first loading units, and reading the input feature data from the storage unit, includes: in the multiple first loading units After the input feature data corresponding to each of the first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to load the systolic processing array the input feature data.

In some embodiments, the sending subunit includes a first parsing subunit and a second parsing subunit; and the sending a read instruction to the storage unit by the sending subunit includes: passing the first parsing subunit , receive the loading instruction, parse the loading instruction to generate description information of the input feature data to be loaded; and parse the description information through the second parsing subunit, and send the description information to the storage unit according to the parsing result the read command.

In some embodiments, the description information includes at least one of the following: the number of groups of input feature data to be loaded, the number of the first group of input feature data to be loaded, and the group of input feature data that can be processed simultaneously by the corresponding array block. number, the number of groups of input feature data processed by the pulsation processing array at the same time, the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the location of each group of input feature data to be loaded The size of the occupied storage space and the width of the input feature data to be loaded.

In some embodiments, the cache sub-unit includes a first cache sub-unit and a second cache sub-unit; the caching of the input feature data returned according to the read instruction by the cache sub-unit includes: generating an auxiliary loading information, and determine the loading mode of the input feature data to be loaded through the auxiliary loading information; cache the auxiliary loading information through the first cache subunit; The auxiliary loading information is read in the first cache subunit, and the input feature data returned according to the read instruction is cached according to the auxiliary loading information.

In some embodiments, the second cache subunit reads the auxiliary loading information from the first cache subunit, and caches the input feature data according to the auxiliary loading information, further comprising: : through the third cache subunit, cache the input characteristic data returned by the storage unit according to the read instruction; through the read-write subunit, read the auxiliary loading information from the first cache subunit, Rearrange the input feature data cached by the third cache subunit according to the auxiliary loading information, and write the rearranged input feature data into the fourth cache subunit; and pass the fourth cache subunit a unit for buffering the rearranged input feature data, so that the loading subunit loads the rearranged input feature data into the systolic processing array.

In some embodiments, the number of the fourth buffer sub-units is multiple; the caching of the rearranged input feature data by the fourth buffer sub-unit includes: using a plurality of fourth buffer sub-units a unit that sequentially caches the rearranged input feature data; the loading subunit loads the rearranged input feature data into the systolic processing array, including: through the loading subunit, sequentially The input feature data in each of the plurality of fourth buffer subunits is loaded into the systolic processing array.

In some embodiments, when the rearranged input feature data corresponding to the previous fourth cache subunit in the plurality of fourth cache subunits is cached, start to cache the plurality of fourth cache subunits. The rearranged input feature data corresponding to the current fourth cache subunit in the unit is cached.

In some embodiments, each fourth cache subunit of the plurality of fourth cache subunits includes a plurality of fifth cache subunits, and among the plurality of fifth cache subunits included in the fourth cache subunit The height of each fifth cache subunit is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the bit width of the input feature data read from the storage unit.

In some embodiments, the method further includes: in the process of rearranging the input feature data cached by the third cache subunit, performing a second filling on the input feature data cached by the third cache subunit .

In some embodiments, performing the second filling on the input feature data cached by the third cache subunit includes: performing at least one of an upper boundary and a lower boundary of the input feature data cached by the third cache subunit One is filled.

In some embodiments, the auxiliary loading information includes at least one of the following: a sliding step size of a convolution kernel parameter in a column direction of the input feature data, a height of a hole convolution kernel, and a padding parameter.

In some embodiments, the systolic processing array includes a plurality of array blocks, each of which is used to process a set of input feature data; the buffered input feature data is loaded into at least one row in the systolic processing array The processing unit includes: respectively loading the input feature data to at least one array block through each of the plurality of first loading units.

In some embodiments, each first loading unit of the plurality of first loading units includes a cache subunit, and the depth of the cache subunit is equal to the depth of the array block loaded by the first loading unit; the The caching of the read input feature data includes: caching the input feature data read from the storage unit through the cache subunit.

In some embodiments, the heights of each array block are all equal.

In some embodiments, the size of an array block is equal to the size of a convolution kernel parameter loaded in the systolic processing array.

In some embodiments, the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.

In some embodiments, the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.

In some embodiments, the number of columns of the systolic processing array is an integer multiple of three.

In some embodiments, the time when the plurality of first loading units load data to the u-th processing unit of the systolic processing array is earlier than the time when the plurality of first loading units load the u+1-th processing unit of the systolic processing array The time when the row processing unit loads data is one clock cycle earlier, and u is a positive integer.

The specific embodiment of the first loading unit in the above method embodiment is the same as the embodiment of the first loading unit 602 in the foregoing data processing apparatus, and details are not described herein again.

Embodiments of the present disclosure further include a data processing apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the program described in any of the embodiments when the processor executes the program. The steps of the method performed by any of the first loading units. The above-mentioned data processing apparatus may be a data processing chip, for example, a system-on-chip (SoC).

The embodiments of the present specification further provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the method described in any of the foregoing embodiments.

Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

From the description of the above embodiments, those skilled in the art can clearly understand that the embodiments of the present specification can be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of this specification or the parts that make contributions to the prior art may be embodied in the form of software products, and the computer software products may be stored in storage media, such as ROM/RAM, A magnetic disk, an optical disk, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments in this specification.

The systems, devices, modules, units or neural network accelerators described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable device, or a combination of any of these devices.

Various technical features in the above embodiments can be combined arbitrarily, as long as there is no conflict or contradiction between the combinations of features, but due to space limitations, they are not described one by one, so the various technical features in the above embodiments can be combined arbitrarily It is also within the scope of this disclosure.

Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the disclosure and practice of the specification disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common general knowledge or techniques in the technical field not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The above are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims

A data processing device for loading input feature data into a pulsation processing array, characterized in that the device comprises:

a plurality of first loading units, each first loading unit of the plurality of first loading units for:

accessing memory cells in parallel to read input feature data from the memory cells;

buffering the read input feature data; and

The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
The data processing apparatus according to claim 1, wherein the sum of the buffering rates of valid data in the input characteristic data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
The data processing apparatus according to claim 2, wherein the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or

The sum of the average cache rates of the input feature data is greater than or equal to the average load rate of the systolic processing array; or

The sum of the average cache rates of the input feature data is greater than or equal to the maximum load rate of the systolic processing array.
The data processing apparatus according to claim 1, wherein the plurality of first loading units are further used for:

In the process of loading the input feature data, a first filling is performed on the input feature data.
The data processing apparatus according to claim 4, wherein the first padding comprises padding at least one of a left boundary and a right boundary of the input feature data.
The data processing device according to claim 1, wherein the first loading unit can read one data block of the input feature data from the storage unit at a time, and after caching the data block is completed The data block can be loaded into at least one row of processing units in the systolic processing array.
The data processing apparatus according to claim 6, wherein the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
The data processing apparatus according to claim 1, wherein the first loading unit comprises:

a sending subunit for sending a read instruction to the storage unit;

a cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction; and

The loading subunit is used for loading the buffered input feature data into at least one row of processing units in the systolic processing array.
The data processing apparatus according to claim 8, wherein the height of the cache subunit is equal to the height of the systolic processing array.
The data processing apparatus according to claim 9, wherein, in the case that the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the corresponding The height of the array block.
The data processing device according to claim 8, wherein the number of the cache subunits is multiple, and the multiple cache subunits can sequentially cache the input feature data returned according to the read instruction;

The loading subunit is configured to sequentially load the input feature data in each of the plurality of buffering subunits into the at least one row of processing units of the systolic processing array.
The data processing device according to claim 11, wherein each cache subunit in the plurality of cache subunits can start to process the current cache subunit when the input feature data corresponding to the previous cache subunit is cached. The input feature data corresponding to the unit is cached.
The data processing apparatus according to claim 8, wherein the cache subunit comprises a plurality of cache blocks, and each cache block in the plurality of cache blocks is used to cache a row of processing in the systolic processing array input feature data required by the unit, the loading subunit is configured to load the input feature data buffered by each cache block in the plurality of cache blocks into a corresponding row of processing units in the systolic processing array;

Wherein, after the input feature data of the vth cache block is loaded, a first load instruction can be sent to the v+1th cache block, so that the loading subunit starts to cache the v+1th cache block The input feature data is loaded for loading, and v is an integer greater than 1.
The data processing apparatus according to claim 1, wherein each first loading unit in the plurality of first loading units is further used for:

After the input feature data corresponding to each first loading unit in the plurality of first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to send the The systolic processing array loads the input feature data corresponding to the next first loading unit.
The data processing device according to claim 8, wherein the sending subunit comprises:

a first parsing subunit, configured to receive a loading instruction, and parse the loading instruction to generate description information of the input feature data to be loaded; and

The second parsing subunit is configured to parse the description information, and send the read instruction to the storage unit according to the parsing result.
The data processing apparatus according to claim 15, wherein the description information includes at least any one of the following:

The number of groups of input characteristic data to be loaded, the number of the first group of input characteristic data to be loaded, the number of groups of input characteristic data that can be processed simultaneously by the corresponding array block, the number of groups of input characteristic data to be processed simultaneously by the pulsation processing array , the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the size of the storage space occupied by each group of input feature data to be loaded, and the width of the input feature data to be loaded.
The data processing apparatus according to claim 15, wherein the second parsing subunit is further configured to:

generating auxiliary loading information, the auxiliary loading information is used to determine the loading mode of the input feature data to be loaded;

The cache subunit includes:

a first cache subunit, configured to cache the auxiliary loading information; and

The second cache subunit is configured to read the auxiliary loading information from the first cache subunit, and cache the input feature data returned according to the read instruction according to the auxiliary loading information.
The data processing apparatus according to claim 17, wherein the second cache subunit comprises:

A third cache subunit, configured to cache the input feature data returned by the storage unit according to the read instruction;

The read-write subunit is configured to read the auxiliary loading information from the first buffering subunit, rearrange the input feature data buffered by the third buffering subunit according to the auxiliary loading information, and reorder the The rowed input feature data is written into the fourth buffer subunit; and

The fourth buffering subunit is used for buffering the rearranged input feature data, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
The data processing device according to claim 18, wherein the number of the fourth buffer subunits is plural, and the plurality of fourth buffer subunits can sequentially buffer the rearranged input feature data;

The loading subunit is configured to sequentially load the input characteristic data in each of the fourth buffering subunits of the plurality of fourth buffering subunits into the systolic processing array.
The data processing apparatus according to claim 19, wherein each fourth cache sub-unit in the plurality of fourth cache sub-units can start to cache all the fourth cache sub-units when the previous fourth cache sub-unit is cached. The rearranged input feature data is cached.
The data processing apparatus according to claim 19, wherein each fourth cache subunit in the plurality of fourth cache subunits comprises a plurality of fifth cache subunits, and the fourth cache subunit includes The height of each fifth cache subunit in the plurality of fifth cache subunits is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the input feature read from the storage unit The bit width of the data.
The data processing device according to claim 18, wherein the read-write subunit is further used for:

In the process of rearranging the input feature data cached by the third cache subunit, a second filling is performed on the input feature data cached by the third cache subunit.
The data processing apparatus according to claim 22, wherein the second filling comprises filling at least one of an upper boundary and a lower boundary of the input feature data buffered by the third buffer subunit.
The data processing apparatus according to claim 17, wherein the auxiliary loading information includes at least any one of the following:

The sliding step size of the convolution kernel parameters in the column direction of the input feature data, the height of the hole convolution kernel, and the padding parameter.
The data processing device according to claim 1, wherein the systolic processing array comprises a plurality of array blocks, each array block is respectively used to process a set of input feature data, the plurality of first loading units Each of the first loading units is respectively used for loading the input feature data into at least one array block.
The data processing apparatus according to claim 25, wherein each first loading unit in the plurality of first loading units includes a buffer sub-unit for processing the input feature data read from the storage unit Cache is performed, and the depth of the cache subunit is equal to the depth of the array block loaded by the first loading unit.
The data processing apparatus according to claim 25, wherein the heights of each array block are equal.
The data processing apparatus according to claim 27, wherein the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array.
The data processing apparatus according to any one of claims 1 to 28, wherein the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
The data processing apparatus according to claim 29, wherein the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
The data processing apparatus according to any one of claims 1 to 28, wherein the number of columns of the systolic processing array is an integer multiple of 3.
The data processing apparatus according to any one of claims 1 to 28, wherein the time at which the plurality of first loading units load data to the processing unit in the u-th row of the systolic processing array is shorter than the time at which the plurality of first loading units load data to the processing unit in the u-th row of the pulsating processing array The time when a loading unit loads data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
A data processing system, characterized in that it comprises the device according to any one of claims 1 to 32; and

The systolic processing array is used for loading the input feature data and convolution kernel parameters, and performing convolution processing on the input feature data and the convolution kernel parameters.
The data processing system of claim 33, wherein the system further comprises:

A second loading unit, configured to load the convolution kernel parameters into the systolic processing array.
The data processing system of claim 33, wherein the system further comprises:

a storage unit for storing the input feature data.
The data processing system of claim 35, wherein the storage unit comprises:

A plurality of mutually independent storage subunits, each storage subunit is used to store part of the data in the input feature data;

The plurality of first loading units are used to access different storage subunits at the same time to acquire input feature data stored in the corresponding storage subunits.
The data processing system of claim 36, wherein the storage unit further comprises:

The scheduling unit is configured to receive access requests from the multiple first loading units, and send the access requests to the corresponding storage subunits, so that the multiple first loading units access the corresponding storage subunits.
The data processing system of claim 33, wherein the system further comprises:

An output unit, configured to acquire the processing result output by the systolic processing array, store the processing result, or output the processing result.
The system according to any one of claims 33 to 38, wherein the system is implemented based on an FPGA or an ASIC.
A neural network accelerator, characterized in that, the neural network accelerator comprises the device described in any one of claims 1 to 32, or the system described in any one of claims 33 to 39.
The neural network accelerator according to claim 40, wherein the neural network accelerator is a CNN accelerator or an RNN accelerator.
A data processing method, applied to each first loading unit in a data processing apparatus including a plurality of first loading units, to load input characteristic data into a pulsation processing array, wherein the method comprises:

The input feature data is read from the storage unit by accessing the storage unit in parallel by each of the plurality of first loading units;

buffering the read input feature data; and

The buffered input feature data is loaded into at least one row of processing elements in a systolic processing array.
The data processing method according to claim 42, wherein the sum of the buffering rates of the valid data in the input characteristic data by the plurality of first loading units is greater than or equal to the loading rate of the systolic processing array.
The data processing method according to claim 43, wherein the sum of the buffering rates of the input feature data in each clock cycle is greater than or equal to the loading rate of the systolic processing array in the clock cycle; or

The sum of the average cache rates of the input feature data is greater than or equal to the average load rate of the systolic processing array; or

The sum of the average cache rates of the input feature data is greater than or equal to the maximum load rate of the systolic processing array.
The data processing method according to claim 42, wherein the method further comprises:

In the process of loading the input feature data, a first filling is performed on the input feature data.
The data processing method according to claim 45, wherein the performing the first filling on the input feature data comprises:

Padding is performed on at least one of the left boundary and the right boundary of the input feature data.
The data processing method according to claim 42, wherein the input feature data is read from the storage unit by accessing a storage unit in parallel through each of the plurality of first loading units ,include:

Each time a data block in the input feature data is read from the storage unit;

The loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes:

When the data block is cached, the data block is loaded into at least one row of processing units in the systolic processing array.
The data processing method according to claim 47, wherein the number of columns or rows of the data block is equal to the number of data corresponding to the minimum access unit of the storage unit.
The data processing method according to claim 42, wherein each first loading unit in the plurality of first loading units comprises a sending subunit, a buffering subunit and a loading subunit; Each of the first loading units accesses the storage unit in parallel, and reads input feature data from the storage unit, including:

Send a read instruction to the storage unit through the sending subunit;

The input feature data returned according to the read instruction is cached by the cache subunit; and the cached input feature data is loaded into at least one row of processing units in the systolic processing array by the loading subunit .
The data processing method according to claim 49, wherein the height of the cache subunit is equal to the height of the systolic processing array.
The data processing method according to claim 50, wherein when the systolic processing array includes a plurality of array blocks, each cache subunit corresponds to one array block, and the height of the cache subunit is equal to the corresponding The height of the array block.
The data processing method according to claim 49, wherein the number of the cache subunits is multiple;

The buffering subunit for buffering the input feature data returned according to the read instruction includes:

The input feature data returned according to the read instruction is sequentially cached through a plurality of cache subunits;

The loading of the buffered input feature data into at least one row of processing units in the systolic processing array by loading subunits includes:

The input feature data in each of the plurality of cache subunits is sequentially loaded into the at least one row of processing units of the systolic processing array.
The data processing method according to claim 52, wherein the step of sequentially caching the input feature data returned according to the read instruction by using the plurality of cache subunits comprises:

When the input feature data corresponding to the previous cache subunit in the plurality of cache subunits is buffered, start to cache the input feature data corresponding to the current cache subunit in the plurality of cache subunits.
The data processing method according to claim 49, wherein the cache subunit comprises a plurality of cache blocks;

The buffering subunit for buffering the input feature data returned according to the read instruction includes:

Cache input feature data required by a row of processing units in the systolic processing array by using each of the plurality of cache blocks;

Described through the described loading subunit, the input feature data of the buffer is loaded into at least one row of processing units in the systolic processing array, including:

By the loading subunit, the input feature data cached in each cache block in the plurality of cache blocks is loaded into a corresponding row of processing units in the systolic processing array; wherein, in the vth cache block cached input feature data After the feature data loading is completed, a first loading instruction is sent to the v+1th cache block, so that the loading subunit starts to load the input feature data cached in the v+1th cache block; wherein, v is an integer greater than 1.
The data processing method according to claim 42, wherein the input is read from the storage unit by accessing a storage unit in parallel through each of the plurality of first loading units Characteristic data, including:

After the input feature data corresponding to each first loading unit in the plurality of first loading units is loaded, a second loading instruction is sent to the next first loading unit to trigger the next first loading unit to send the A systolic processing array loads the input feature data.
The data processing method according to claim 49, wherein the sending subunit comprises a first parsing subunit and a second parsing subunit; and

The sending a read instruction to the storage unit by sending the subunit includes:

receiving, by the first parsing subunit, a loading instruction, and parsing the loading instruction to generate description information of the input feature data to be loaded; and

The description information is parsed by the second parsing subunit, and the read instruction is sent to the storage unit according to the parsing result.
The data processing method according to claim 56, wherein the description information includes at least any one of the following:

The number of groups of input feature data to be loaded, the number of the first group of input feature data to be loaded, the number of groups of input feature data that can be processed simultaneously by the corresponding array block, the number of groups of input feature data simultaneously processed by the pulsation processing array , the height of the convolution kernel, the starting position of the corresponding array block, the base address of the input feature data to be loaded, the size of the storage space occupied by each group of input feature data to be loaded, and the width of the input feature data to be loaded.
The data processing method according to claim 56, wherein the cache sub-unit comprises a first cache sub-unit and a second cache sub-unit; and the passing of the cache sub-unit, for the input returned according to the read instruction Feature data is cached, including:

generating auxiliary loading information, and determining the loading mode of the input feature data to be loaded through the auxiliary loading information;

The auxiliary loading information is cached by the first cache subunit;

The auxiliary loading information is read from the first caching sub-unit through the second cache sub-unit, and the input feature data returned according to the read instruction is cached according to the auxiliary loading information.
The data processing method according to claim 58, wherein the auxiliary loading information is read from the first buffer sub-unit through the second cache sub-unit, and the auxiliary loading information is processed according to the auxiliary loading information. The input feature data is cached, and further includes:

Through the third cache subunit, the input feature data returned by the storage unit according to the read instruction is cached;

The auxiliary loading information is read from the first cache subunit by reading and writing subunits, the input feature data buffered by the third cache subunit is rearranged according to the auxiliary loading information, and the rearrangement is performed. After the input feature data is written into the fourth buffer subunit; and

The rearranged input feature data is cached by the fourth buffering subunit, so that the loading subunit loads the rearranged input feature data into the systolic processing array.
The data processing method according to claim 59, wherein the number of the fourth cache subunits is multiple;

The cached input feature data after the rearrangement is cached by the fourth cache subunit, including:

Cache the rearranged input feature data in sequence through a plurality of fourth cache subunits;

The loading subunit loads the rearranged input feature data into the systolic processing array, including:

Through the loading subunit, the input feature data in each of the fourth buffering subunits of the plurality of fourth buffering subunits is sequentially loaded into the systolic processing array.
The data processing method according to claim 60, wherein when the rearranged input feature data corresponding to the previous fourth buffer subunit in the plurality of fourth buffer subunits is buffered, start Cache the rearranged input feature data corresponding to the current fourth cache subunit in the plurality of fourth cache subunits.
The data processing method according to claim 60, wherein each fourth cache subunit in the plurality of fourth cache subunits includes a plurality of fifth cache subunits, and the fourth cache subunit includes The height of each fifth cache subunit in the plurality of fifth cache subunits is equal to the height of the fourth cache subunit, and the width of each fifth cache subunit is equal to the input feature read from the storage unit The bit width of the data.
The data processing method according to claim 59, wherein the method further comprises:

In the process of rearranging the input feature data cached by the third cache subunit, a second filling is performed on the input feature data cached by the third cache subunit.
The data processing method according to claim 63, wherein the second filling of the input feature data buffered by the third buffer subunit comprises:

Filling at least one of an upper boundary and a lower boundary of the input feature data buffered by the third buffer subunit.
The data processing method according to claim 58, wherein the auxiliary loading information includes at least any one of the following:

The sliding step size of the convolution kernel parameters in the column direction of the input feature data, the height of the hole convolution kernel, and the padding parameter.
The data processing method according to claim 42, wherein the systolic processing array comprises a plurality of array blocks, and each array block is respectively used to process a set of input feature data;

The loading of the buffered input feature data into at least one row of processing units in the systolic processing array includes:

The input feature data is respectively loaded into at least one array block by each of the plurality of first loading units.
The data processing method according to claim 66, wherein each first loading unit in the plurality of first loading units comprises a buffer sub-unit, and the depth of the buffer sub-unit is the same as that of the first loading unit. The depth of the loaded array blocks is equal;

The caching of the read input feature data includes:

The input feature data read from the storage unit is cached by the cache subunit.
The data processing method according to claim 66, wherein the heights of each array block are equal.
The data processing method according to claim 68, wherein the size of one array block is equal to the size of one convolution kernel parameter loaded in the systolic processing array.
The data processing method according to any one of claims 42 to 69, wherein the size of the systolic processing array is determined according to the size of the convolution kernel parameters loaded in the systolic processing array.
The data processing method according to claim 70, wherein the size of the systolic processing array is an integer multiple of the size of a convolution kernel parameter loaded in the systolic processing array.
The data processing method according to any one of claims 42 to 69, wherein the number of columns of the systolic processing array is an integer multiple of 3.
The data processing method according to any one of claims 42 to 69, wherein the time at which the plurality of first loading units load data to the processing units in the u-th row of the systolic processing array is earlier than the time at which the plurality of first loading units load data to the processing units in the u-th row of the pulsating processing array. The time when a loading unit loads data to the processing unit in the u+1th row of the systolic processing array is one clock cycle earlier, and u is a positive integer.
A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the method described in any one of claims 42 to 73 is implemented.
A data processing device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the program, any one of claims 42 to 73 is implemented. Steps performed by any of the first loading units in the described method.