WO2022021459A1

WO2022021459A1 - Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium

Info

Publication number: WO2022021459A1
Application number: PCT/CN2020/106761
Authority: WO
Inventors: 王峥; 王卓
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-07-29
Filing date: 2020-08-04
Publication date: 2022-02-03
Also published as: CN114090470A; CN114090470B

Abstract

Disclosed are a data pre-loading apparatus and a data pre-loading method, and a storage medium. The data pre-loading method comprises: acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set jointly form an input data set of a convolutional neural network; and before calculation units perform convolution calculation, storing, according to a predetermined allocation mode, data in the input data set in caches corresponding to the calculation units, so as to respectively form different target data sets in the caches of the calculation units, wherein data in the target data sets is data required by the calculation units during a convolution calculation process. Temporal sorting or spatial sorting is performed on original data and zero-padded data, an input data set is formed, and pieces of data in the input data set are allocated to caches of different calculation units, so that the data reusability is improved, the memory reading frequency is reduced, the data preparation time is shortened, the delay between layer calculations is reduced, and the overall power consumption of a chip is reduced.

Description

Data preloading apparatus and preloading method thereof, and computer readable storage medium

technical field

The present invention belongs to the technical field of integrated circuits, and in particular, relates to a data preloading method and a computer-readable storage medium for a convolutional neural network.

Background technique

In recent years, due to the popularity of big data applications and the advancement of computer hardware, deep learning technology has been used to perform feature extraction, classification and recursive operations on data, and it has a wide range of applications in computer vision, natural language processing, and intelligent system decision-making. . The convolution operation is a very important deep learning feature extraction method. For example, the current mainstream deep learning neural networks such as LeNet1, AlexNet, VGG-16, VGG-19, etc. are composed of layers of convolution layers. of. As the number of network layers increases, the classification accuracy improves. However, since the convolution operation itself consumes a lot of computing power, the computing power and speed of the general computer platform cannot keep up, so it is necessary to design a dedicated convolution processing chip.

For dedicated convolution processing chips, by adding computing nodes, increasing data cache, improving data type conversion and other optimization methods, the calculation speed of the computing unit can be greatly improved. However, due to the current dedicated convolution processing chips, the data read The method is still the traditional method, that is, when the processor needs to read data, it first searches in the corresponding cache. If there is no corresponding data in the cache, it searches and reads the corresponding data from the external memory. This process takes a long time. time, which can cause a delay in the overall computation time. Especially when multiple processor units need to read data from external memory, multiple processor units can only access the memory sequentially, which will prolong the data loading time.

Therefore, on the basis of improving the operation speed of the computing unit, how to improve the data loading speed is a technical problem that needs to be solved urgently in the art.

SUMMARY OF THE INVENTION

(1) Technical problem to be solved by the present invention

The technical problem solved by the present invention is: how to improve the data loading speed to speed up the overall operation speed of the convolutional neural network.

(2) Technical scheme adopted in the present invention

A data preloading method for a convolutional neural network, the data preloading method comprising:

Obtaining an original data set and a zero-padded data set, the original data set and the zero-padded data set together constitute an input data set of the convolutional neural network;

Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit, wherein, The data in the target dataset is the data required by the computing unit in the convolution calculation process.

Optionally, a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method includes:

splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;

Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;

Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.

Optionally, each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same.

Optionally, the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the addresses between different address segments are arranged at intervals.

Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;

Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;

The data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.

Optionally, each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.

Optionally, the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals.

The present application also discloses a data preloading device for a convolutional neural network, the data preloading device comprising:

a data acquisition module for acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network;

The data distribution module is used to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.

Optionally, the data loading device further includes a configuration decoder, and the configuration decoder is configured to receive a configuration file and generate a preset spatial sequence and target addresses of each computing unit according to the received configuration files, and the target addresses are each The spatial addresses corresponding to the data required by each of the computing units when performing convolution calculations;

The data distribution module includes:

a data splicing unit for splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;

a data storage unit for storing the input data set;

The address reading unit is used to read the target address of each computing unit;

The data allocation unit is configured to store the data with the same spatial address as the target address in the input data set into the cache corresponding to each computing unit.

Alternatively, the data loading device further includes a configuration decoder, which is configured to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is each The time number corresponding to the required data when the computing unit performs the convolution calculation;

The data distribution module includes:

a time coding unit, configured to set a time number for each data of the original data set and each data of the zero-padded data set according to the preset time sequence;

The sequence acquisition unit is used to acquire the target sequence of each computing unit;

A data allocation unit, configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.

Optionally, the configuration decoder is further configured to generate memory address information and space information of zero-padding data to be generated according to the configuration file;

The data acquisition module includes:

a memory controller, configured to read data from the memory according to the memory address information to form an original data set;

A zero-padding generator, configured to generate a zero-padding data set according to the spatial information.

The present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for the convolutional neural network, and the data preloading program for the convolutional neural network is executed by a processor The above data preloading method for convolutional neural network is implemented when

(3) Beneficial effects

The invention discloses a data preloading method for a convolutional neural network, which has the following technical effects compared with the traditional calculation method:

(1) By sorting the original data and zero-filled data in time or space, an input data set is formed, and each data in the input data set is allocated to the cache of different computing units, thereby improving data reuse and reducing the number of memory reads. , reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.

Description of drawings

1 is a flowchart of a data preloading method for a convolutional neural network according to Embodiment 1 of the present invention;

2 is a schematic diagram of zero-filling original data according to Embodiment 1 of the present invention;

3 is a flowchart of data allocation according to Embodiment 1 of the present invention;

4 is a schematic diagram of an input data set according to Embodiment 1 of the present invention;

5 is a flowchart of data allocation according to Embodiment 2 of the present invention;

6 is a schematic diagram of an input data set according to Embodiment 2 of the present invention;

7 is an architectural diagram of a data preloading apparatus according to Embodiment 3 of the present invention;

FIG. 8 is a connection diagram of a data preloading device, a memory, and a cache according to Embodiment 3 of the present invention;

FIG. 9 is an architectural diagram of a data preloading apparatus according to Embodiment 4 of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

Before describing the various embodiments of the present application in detail, first briefly describe the inventive concept of the present application: in the prior art, during the calculation process, the processor will read data from the external memory and temporarily store it in the cache. If new data is needed, it must be read from the external memory. Because the calculation speed of the processor is faster, and the speed of reading data is slower, the processor "waiting" for data reading during the calculation process, resulting in The overall operation time is longer. In the present application, the data to be used by each computing unit is stored in the corresponding cache before the computing unit starts to calculate. After the computing unit starts to calculate, the computing unit only needs to read the corresponding data from the cache, which greatly improves the reading speed. This reduces the overall computation time.

Example 1

Specifically, as shown in FIG. 1 , the data preloading method for a convolutional neural network in Embodiment 1 includes the following steps:

Step S10: Obtain an original data set and a zero-padded data set, where the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network.

Step S20: Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit respectively. , where the data in the target dataset is the data required by the computing unit in the convolution calculation process.

Specifically, when images of different sizes are input into the convolutional neural network, output images of different sizes are obtained after convolution calculations. In order to facilitate the calculation of subsequent convolutional layers, the original image data needs to be zero-padded to form an image of a specific size. After processing by the first convolutional layer, an output image of a specific size is generated. Further, in step S10, the original data set is read from the memory according to configuration parameters and a corresponding zero-padded data set is generated, and the configuration parameters include the address length of memory access, the length of zero-padded data, and the like. Exemplarily, as shown in Figure 2, the right image is the output image, the middle image is the convolution kernel, and the left image is the input image. Among them, the output image size is 28*28, the calculation unit is 64, there are 64 output data from (0, 0) to (2, 7), and the convolution kernel size is 5*5. The dotted box in the left figure represents the original input image with a size of 28*28, that is, the original data set. Two lines of zero data are added to the periphery of the dotted box to form an input image with a size of 32*32, so that the input image is checked by convolution The 204 data from (0, 0) to (6, 11) in the image can be obtained by convolution operation to obtain 64 output data from (0, 0) to (2, 7) in the output image. Of course, the above is an exemplary description, and the position and length of the zero-padding data are determined according to actual needs, which will not be repeated here.

Further, as shown in Figure 3, step S20 includes the following steps:

Step S21: setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;

Step S22: obtaining the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;

Step S23: Store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.

Specifically, by obtaining pre-set configuration parameters, including the chronological order of each original data and zero-padded data, the original data set and the zero-padded data set are "spliced" in the time dimension to form an input data stream. Match the time number of each data in the input data stream with the time number corresponding to the data required by each computing unit when performing the convolution calculation, that is, the target sequence, and store the matched data in the cache corresponding to each computing unit. The cache of the computing unit can read all the required data in advance.

Exemplarily, as shown in FIG. 4 , assuming that the size of the input data is 5*5, the area inside the dashed box represents the original data set, and the area outside the dashed box represents the zero-padded data set, according to (0, 0) to (4, 4) The sequence of time is numbered, and each number represents the time number. Assuming that the size of the convolution kernel is 3*3, for the three calculation units, the target sequence of the first calculation unit is (1, 2, 3, 6, 7, 8, 11, 12, 13), the target sequence of the second calculation unit is (2, 3, 4, 7, 8, 9, 12, 13, 14), and the third calculation unit is the target sequence of (3 , 4, 5, 8, 9, 10, 13, 14, 15), in this way, the data corresponding to the time number in the input data is allocated to the buffer of each computing unit. It can be seen from the target sequences of the above calculation units that the data required by each calculation unit is discontinuous. If the data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data. , the loading process will take a long time. In this embodiment, data is stored in advance through the matching of time numbers, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.

As a preferred embodiment, the numbered input data is matched and stored with the data of each computing unit in the form of a data stream in chronological order, for example, the first row of data streams (1, 2, 3, 4, 5) Compared with the three computing units in turn, according to the target sequence of the three computing units, it can be known that three data (1, 2, 3) are stored in the cache of the first computing unit, and ( 2, 3, 4) three data, store (3, 4, 5) three data in the cache of the third computing unit, and so on, to complete the allocation of each data. Of course, in other implementations, the data may be sequentially allocated in other order, which is not limited here.

It should be noted that, in the first embodiment, the numbering and allocation of each data can be carried out simultaneously, that is, according to the configuration parameters set in advance, while reading each original data and generating each zero-filling data, according to a predetermined time sequence. Each data is directly written into the cache of each computing unit, and no additional storage device is required to store intermediate data.

As a preferred embodiment, each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence. The target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals. Exemplarily, for the three adjacent computing units above, the data with time number 1 is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse. For the first calculation unit, the target sequence is (1, 2, 3, 6, 7, 8, 11, 12, 13), including three sequence segments (1, 2, 3), (6, 7, 8), (11, 12, 13), it can be seen that the data time series between the segments are far apart, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. It takes a long time to load the data before the required data can be read after the address is reached. The present application reduces the data load time in the calculation process by preloading each data into the cache of each computing unit in advance.

The data preloading method in this embodiment forms a data stream by sorting the original data and zero-filled data in time, and assigns each data in the data stream to the caches of different computing units, thereby improving data reuse and reducing memory. The number of reads, the data preparation time is reduced, the delay between layer calculations is reduced, and the overall power consumption of the chip can be reduced. At the same time, it can be allocated while sorting, without adding additional memory to store intermediate data, reducing costs.

Embodiment 2

As shown in FIG. 5 , the difference between the data preloading method for a convolutional neural network in the second embodiment and the first embodiment is step S20, and the step S20 in the second embodiment includes the following steps:

Step S21': splicing each data of the original data set and each data of the zero-filling data set according to a preset spatial order to form an input data set;

Step S22': obtain the preset target address of each computing unit, and the target address is the space address corresponding to the required data when each described computing unit performs convolution calculation;

Step S23': Store the data with the same spatial address and target address in the input data set into the cache corresponding to each computing unit.

Specifically, by obtaining pre-set configuration parameters, including the spatial position of each original data and zero-padded data in the input data set to be formed, etc., the original data set and the zero-padded data set are spliced in the spatial dimension to form the input data set. data set. Match the spatial address of each data of the input data set with the address corresponding to the data required by each computing unit when performing the convolution calculation, that is, the target address, and store the matched data in the cache corresponding to each computing unit, so that each computing unit The cache can read all the required data in advance. As shown in Figure 6, assuming that the size of the input data is 5*5, the area inside the dashed box represents the original data set, and the area outside the dashed box represents the zero-padded data set. According to the space from (0, 0) to (4, 4) Sequential splicing, each letter represents the spatial position of the original data and zero-padded data in the input data set, assuming that the size of the convolution kernel is 3*3, for the three calculation units, the target address of the first calculation unit is (A, B, C, F, G, H, K, L, M), the target address of the second calculation unit is (B, C, D, G, H, I, L, M, N), and the third calculation unit is The destination address is (C, D, E, H, I, J, M, N, O). Allocate the corresponding spatial address in the input data set to the cache of each computing unit. It can be seen from the target addresses of the above calculation units that the addresses of the data required by each calculation unit are discontinuous. If data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data, the loading process will take a long time. In this embodiment, data is stored in advance through the matching of spatial addresses, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.

As a preferred embodiment, the input data set formed by splicing is stored in the memory, and then in the form of a data stream, the data of each computing unit is sequentially matched and stored in spatial order, for example, the first row of data (A, B , C, D, E) are compared with the three calculation units in turn. According to the target addresses of the three calculation units, it can be known that three data (A, B, C, ) are stored in the cache of the first calculation unit, and three data are stored in the cache of the first calculation unit. Three data (B, C, D) are stored in the cache of one computing unit, and three data (C, D, E) are stored in the cache of the third computing unit, and so on, to complete the allocation of each data. Of course, in other implementations, the data may be sequentially allocated in other order, which is not limited here.

As a preferred embodiment, each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same. The target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the time numbers between different address segments are arranged at intervals. Exemplarily, for the above three adjacent computing units, the data with time number A is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse. For the first computing unit, its target address is (A, B, C, F, G, H, K, L, M), including three address segments (A, B, C), (F, G, H), (K, L, M), it can be seen that the data between different sections are far apart in the spatial dimension, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. After searching for multiple addresses, the required data can be read, which will take a long time to load the data. The present application reduces the data loading time in the calculation process by preloading each data into the cache of each computing unit in advance.

The data preloading method in this embodiment forms an input data set by splicing and storing the original data and the zero-filled data in spatial order, and assigns each data in the input data set to the caches of different computing units, thereby improving data complexity. It can solve the problem of low data loading efficiency caused by discontinuous data addresses obtained by the computing unit, reduce the number of memory reads, reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.

Embodiment 3

As shown in FIG. 7 , the data preloading device for convolutional neural network according to the third embodiment includes a data acquisition module 100 ′ and a data distribution module 200 ′, wherein the data acquisition module 100 ′ is used to acquire the original data set and the complementary data A zero data set, the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network. The data distribution module 200 ′ is configured to store the data in the input data set in the buffer corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the buffers of each computing unit. The target data set of , wherein the data in the target data set is the data required by the computing unit in the convolution calculation process. As shown in FIG. 8 , the original data is stored in the memory, the original data is transmitted to the data preloading device, and is allocated to the buffers of different computing units together with the zero-filling data generated by the data preloading device.

Further, the data preloading apparatus further includes a configuration decoder 300', and the configuration decoder 300' is configured to generate memory address information and space information of the zero-fill data to be generated according to the configuration file, and the memory address information includes the address of memory access. Length, the space information of the zero-padded data includes the length of the zero-padded data.

Further, the data acquisition module 100' includes a memory controller 101' and a zero-fill generator 102', and the memory controller 101 reads data from the memory according to the memory address information to form an original data set; The generator 102 generates a zero-padded data set according to the spatial information.

Further, the configuration decoder 300' is used to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is a convolution performed for each of the computing units. The time number corresponding to the data required for the calculation;

The data distribution module 200' includes a time encoding unit 201', a sequence acquisition unit 202' and a data distribution unit 203'. Wherein, the time coding unit 201' is used to set time numbers for each data of the original data set and each data of the zero-padded data set according to the preset time sequence; the sequence acquisition unit 202' is used to acquire each calculation unit The data allocation unit 203' is configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in the cache corresponding to each computing unit. The data processing process of the data distribution module 200' is described with reference to Embodiment 1, and will not be repeated here.

Embodiment 4

As shown in FIG. 9 , the data preloading device for convolutional neural network according to the fourth embodiment includes a data acquisition module 100 and a data distribution module 200 , wherein the data acquisition module 100 is used for acquiring the original data set and the zero-padded data set , the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network. The data distribution module 200 is configured to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit respectively. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.

Specifically, the data preloading apparatus further includes a configuration decoder 300, and the configuration decoder 300 is configured to generate memory address information and space information of the zero-padded data to be generated according to the configuration file, and the memory address information includes the address length of the memory access. , the space information of the zero-padded data includes the length of the zero-padded data.

The data acquisition module 100 includes a memory controller 101 and a zero-fill generator 102, wherein the memory controller 101 reads data from the memory according to the memory address information to form an original data set; the zero-fill generator 102 according to the Spatial information generates zero-padded datasets.

Further, the configuration decoder 300 is further configured to receive a configuration file and generate a preset spatial sequence and a target address of each computing unit according to the received configuration file, and the target address is that each of the computing units is performing convolution. The space address corresponding to the data required for calculation.

The data distribution module 200 includes a data splicing unit 201, an address reading unit 202, a data distribution unit 203 and a data storage unit 204, wherein the data splicing unit 201 is used to combine each data of the original data set and the zero-filling data. Each data of the set is spliced according to the preset spatial order to form an input data set; the data storage unit 204 is used to store the input data set; the address reading unit 202 is used to read the target address of each computing unit; the data distribution unit 203 is used to Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit. The process of processing data by the data distribution module 200 is described with reference to the second embodiment and will not be repeated here.

In order to verify the technical effect of the data preloading method of the first embodiment, a simulation experiment is carried out. It is implemented by Verilog HDL language, and the feasibility and running time of the design scheme are simulated and verified by Modelsim simulation tool. The experimental plan is mainly to configure a configuration file for a specific neural network, write image data into the memory, and then give a start signal, and the simulation will be carried out automatically.

The experimental process is as follows: this experiment uses 64 computing units, the input image size is 28*28, the number of channels is 64, the output image size is 28*28, the number of channels is 128, and the convolution kernel size is 5*5. The configuration file starts the convolution calculation operation. Experiment 1 uses the data preloading algorithm of Embodiment 1 to allocate data, the total time from memory allocation to cache is 0.00504 milliseconds, and the calculation time of the computing unit is 0.00403 milliseconds; Experiment 2 adopts the traditional calculation method, that is, reads from memory while calculating Data, the total time required to read all the data from memory is 0.05814 milliseconds, and the computation time of the computing unit is 0.00403 milliseconds. According to the experimental results, it can be seen that the time to read data by using the data preloading algorithm of the first embodiment is reduced by an order of magnitude compared with the time of the prior art, thereby greatly improving the operation efficiency.

The present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for a convolutional neural network, and the data preloading program for a convolutional neural network is executed by a processor At the same time, the data preloading method for the convolutional neural network in Embodiment 1 or Embodiment 2 is implemented.

The present application also discloses a computer device. At the hardware level, as shown in FIG. 10 , the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 . The processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device. The computer-readable storage medium 11 stores a data preloading program for the convolutional neural network, and when the data preloading program for the convolutional neural network is executed by the processor, the above-mentioned data preloading program for the convolutional neural network is implemented. Data preload method.

Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that the principles and spirit of the present invention, which are defined in the scope of the claims and their equivalents, are not departed from. Under the circumstances, these embodiments can be modified and perfected, and these modifications and improvements should also fall within the protection scope of the present invention.

Claims

A data preloading method for a convolutional neural network, wherein the data preloading method comprises:

Obtaining an original data set and a zero-padded data set, the original data set and the zero-padded data set together constitute an input data set of the convolutional neural network;

Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit, wherein, The data in the target dataset is the data required by the computing unit in the convolution calculation process.
The data preloading method for a convolutional neural network according to claim 1, wherein the specific method for storing the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method comprises:

splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;

Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;

Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
The data preloading method for a convolutional neural network according to claim 2, wherein each computing unit has multiple target addresses, and at least two computing units have the same partial target addresses.
The data preloading method for a convolutional neural network according to claim 3, wherein the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are consecutively arranged, and different addresses are arranged in a row. The addresses are spaced between sections.
The data preloading method for a convolutional neural network according to claim 1, wherein the specific method for storing the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method comprises:

Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;

Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;

The data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
The data preloading method for a convolutional neural network according to claim 5, wherein each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.
The data preloading method for a convolutional neural network according to claim 6, wherein the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers in the same sequence segment are consecutively arranged, and different The time-numbered intervals between the sequence segments are arranged.
A data preloading device for a convolutional neural network, wherein the data preloading device comprises:

a data acquisition module for acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network;

The data distribution module is used to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
The data preloading device for a convolutional neural network according to claim 8, wherein the data loading device further comprises a configuration decoder, the configuration decoder is configured to receive a configuration file and generate a preload according to the received configuration file. Set the spatial order and the target address of each computing unit, and the target address is the spatial address corresponding to the required data when each described computing unit performs the convolution calculation;

The data distribution module includes:

a data splicing unit for splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;

a data storage unit for storing the input data set;

The address reading unit is used to read the target address of each computing unit;

The data allocation unit is configured to store the data with the same spatial address as the target address in the input data set into the cache corresponding to each computing unit.
The data preloading device for a convolutional neural network according to claim 8, wherein the data loading device further comprises a configuration decoder, the configuration decoder is configured to receive a configuration file and generate a preload according to the received configuration file. Set the time sequence and the target sequence of each computing unit, and the target sequence is the time number corresponding to the required data when each described computing unit performs convolution calculation;

The data distribution module includes:

a time coding unit, configured to set a time number for each data of the original data set and each data of the zero-padded data set according to the preset time sequence;

The sequence acquisition unit is used to acquire the target sequence of each computing unit;

A data allocation unit, configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
The data preloading device for a convolutional neural network according to claim 9, wherein the configuration decoder is further configured to generate memory address information and space information of the zero-padded data to be generated according to the configuration file;

The data acquisition module includes:

a memory controller, configured to read data from the memory according to the memory address information to form an original data set;

A zero-padding generator, configured to generate a zero-padding data set according to the spatial information.
A computer-readable storage medium, wherein the computer-readable storage medium stores a data preloading program for a convolutional neural network, and the data preloading program for a convolutional neural network implements rights when executed by a processor The data preloading method for convolutional neural network described in claim 1.
The computer-readable storage medium according to claim 12, wherein a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method comprises:

Each data of the original data set and each data of the zero-padded data set are spliced according to a preset spatial order to form an input data set;

Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;

The data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
The computer-readable storage medium according to claim 13, wherein each computing unit has a plurality of target addresses, and at least two computing units have the same partial target address.
The computer-readable storage medium according to claim 14, wherein the target address of each computing unit includes a plurality of address segments, wherein addresses within a same address segment are arranged consecutively, and addresses between different address segments are arranged in a row. spaced.
The computer-readable storage medium according to claim 12, wherein a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method comprises:

Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;

Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;

Data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
The computer-readable storage medium according to claim 16, wherein the target sequence of each computing unit is multiple, and at least two computing units have the same partial target sequence.
The computer-readable storage medium according to claim 17, wherein the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers in the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged consecutively. The time numbers are arranged at intervals.