WO2022021459A1 - Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium - Google Patents

Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium Download PDF

Info

Publication number
WO2022021459A1
WO2022021459A1 PCT/CN2020/106761 CN2020106761W WO2022021459A1 WO 2022021459 A1 WO2022021459 A1 WO 2022021459A1 CN 2020106761 W CN2020106761 W CN 2020106761W WO 2022021459 A1 WO2022021459 A1 WO 2022021459A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data set
computing unit
target
address
Prior art date
Application number
PCT/CN2020/106761
Other languages
French (fr)
Chinese (zh)
Inventor
王峥
王卓
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022021459A1 publication Critical patent/WO2022021459A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention belongs to the technical field of integrated circuits, and in particular, relates to a data preloading method and a computer-readable storage medium for a convolutional neural network.
  • the convolution operation is a very important deep learning feature extraction method.
  • the current mainstream deep learning neural networks such as LeNet1, AlexNet, VGG-16, VGG-19, etc. are composed of layers of convolution layers. of.
  • the classification accuracy improves.
  • the convolution operation itself consumes a lot of computing power, the computing power and speed of the general computer platform cannot keep up, so it is necessary to design a dedicated convolution processing chip.
  • the calculation speed of the computing unit can be greatly improved.
  • the data read The method is still the traditional method, that is, when the processor needs to read data, it first searches in the corresponding cache. If there is no corresponding data in the cache, it searches and reads the corresponding data from the external memory. This process takes a long time. time, which can cause a delay in the overall computation time. Especially when multiple processor units need to read data from external memory, multiple processor units can only access the memory sequentially, which will prolong the data loading time.
  • the technical problem solved by the present invention is: how to improve the data loading speed to speed up the overall operation speed of the convolutional neural network.
  • a data preloading method for a convolutional neural network comprising:
  • the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit, wherein,
  • the data in the target dataset is the data required by the computing unit in the convolution calculation process.
  • a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method includes:
  • the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation
  • Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
  • each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same.
  • the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the addresses between different address segments are arranged at intervals.
  • a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method includes:
  • the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation
  • the data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
  • each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.
  • the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals.
  • the present application also discloses a data preloading device for a convolutional neural network, the data preloading device comprising:
  • a data acquisition module for acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network
  • the data distribution module is used to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit.
  • the target data set wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
  • the data loading device further includes a configuration decoder, and the configuration decoder is configured to receive a configuration file and generate a preset spatial sequence and target addresses of each computing unit according to the received configuration files, and the target addresses are each The spatial addresses corresponding to the data required by each of the computing units when performing convolution calculations;
  • the data distribution module includes:
  • a data splicing unit for splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set
  • a data storage unit for storing the input data set
  • the address reading unit is used to read the target address of each computing unit
  • the data allocation unit is configured to store the data with the same spatial address as the target address in the input data set into the cache corresponding to each computing unit.
  • the data loading device further includes a configuration decoder, which is configured to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is each The time number corresponding to the required data when the computing unit performs the convolution calculation;
  • the data distribution module includes:
  • a time coding unit configured to set a time number for each data of the original data set and each data of the zero-padded data set according to the preset time sequence
  • the sequence acquisition unit is used to acquire the target sequence of each computing unit
  • a data allocation unit configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
  • the configuration decoder is further configured to generate memory address information and space information of zero-padding data to be generated according to the configuration file;
  • the data acquisition module includes:
  • a memory controller configured to read data from the memory according to the memory address information to form an original data set
  • a zero-padding generator configured to generate a zero-padding data set according to the spatial information.
  • the present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for the convolutional neural network, and the data preloading program for the convolutional neural network is executed by a processor
  • the above data preloading method for convolutional neural network is implemented when
  • the invention discloses a data preloading method for a convolutional neural network, which has the following technical effects compared with the traditional calculation method:
  • each data in the input data set is allocated to the cache of different computing units, thereby improving data reuse and reducing the number of memory reads. , reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.
  • FIG. 1 is a flowchart of a data preloading method for a convolutional neural network according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of zero-filling original data according to Embodiment 1 of the present invention.
  • FIG. 4 is a schematic diagram of an input data set according to Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram of an input data set according to Embodiment 2 of the present invention.
  • FIG. 7 is an architectural diagram of a data preloading apparatus according to Embodiment 3 of the present invention.
  • FIG. 8 is a connection diagram of a data preloading device, a memory, and a cache according to Embodiment 3 of the present invention.
  • FIG. 9 is an architectural diagram of a data preloading apparatus according to Embodiment 4 of the present invention.
  • FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present invention.
  • the processor will read data from the external memory and temporarily store it in the cache. If new data is needed, it must be read from the external memory. Because the calculation speed of the processor is faster, and the speed of reading data is slower, the processor "waiting" for data reading during the calculation process, resulting in The overall operation time is longer.
  • the data to be used by each computing unit is stored in the corresponding cache before the computing unit starts to calculate. After the computing unit starts to calculate, the computing unit only needs to read the corresponding data from the cache, which greatly improves the reading speed. This reduces the overall computation time.
  • the data preloading method for a convolutional neural network in Embodiment 1 includes the following steps:
  • Step S10 Obtain an original data set and a zero-padded data set, where the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network.
  • Step S20 Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit respectively. , where the data in the target dataset is the data required by the computing unit in the convolution calculation process.
  • the original image data needs to be zero-padded to form an image of a specific size.
  • an output image of a specific size is generated.
  • the original data set is read from the memory according to configuration parameters and a corresponding zero-padded data set is generated, and the configuration parameters include the address length of memory access, the length of zero-padded data, and the like.
  • the right image is the output image
  • the middle image is the convolution kernel
  • the left image is the input image.
  • the output image size is 28*28
  • the calculation unit is 64
  • the convolution kernel size is 5*5.
  • the dotted box in the left figure represents the original input image with a size of 28*28, that is, the original data set.
  • Two lines of zero data are added to the periphery of the dotted box to form an input image with a size of 32*32, so that the input image is checked by convolution
  • the 204 data from (0, 0) to (6, 11) in the image can be obtained by convolution operation to obtain 64 output data from (0, 0) to (2, 7) in the output image.
  • the above is an exemplary description, and the position and length of the zero-padding data are determined according to actual needs, which will not be repeated here.
  • step S20 includes the following steps:
  • Step S21 setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence
  • Step S22 obtaining the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;
  • Step S23 Store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
  • the original data set and the zero-padded data set are "spliced" in the time dimension to form an input data stream.
  • the cache of the computing unit can read all the required data in advance.
  • the size of the input data is 5*5
  • the area inside the dashed box represents the original data set
  • the area outside the dashed box represents the zero-padded data set, according to (0, 0) to (4, 4)
  • the sequence of time is numbered, and each number represents the time number.
  • the size of the convolution kernel is 3*3
  • the target sequence of the first calculation unit is (1, 2, 3, 6, 7, 8, 11, 12, 13)
  • the target sequence of the second calculation unit is (2, 3, 4, 7, 8, 9, 12, 13, 14)
  • the third calculation unit is the target sequence of (3 , 4, 5, 8, 9, 10, 13, 14, 15) in this way, the data corresponding to the time number in the input data is allocated to the buffer of each computing unit.
  • the data required by each calculation unit is discontinuous. If the data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data. , the loading process will take a long time. In this embodiment, data is stored in advance through the matching of time numbers, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.
  • the numbered input data is matched and stored with the data of each computing unit in the form of a data stream in chronological order, for example, the first row of data streams (1, 2, 3, 4, 5)
  • the data may be sequentially allocated in other order, which is not limited here.
  • the numbering and allocation of each data can be carried out simultaneously, that is, according to the configuration parameters set in advance, while reading each original data and generating each zero-filling data, according to a predetermined time sequence.
  • Each data is directly written into the cache of each computing unit, and no additional storage device is required to store intermediate data.
  • each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.
  • the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals.
  • the data with time number 1 is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse.
  • the target sequence is (1, 2, 3, 6, 7, 8, 11, 12, 13), including three sequence segments (1, 2, 3), (6, 7, 8), (11, 12, 13), it can be seen that the data time series between the segments are far apart, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. It takes a long time to load the data before the required data can be read after the address is reached.
  • the present application reduces the data load time in the calculation process by preloading each data into the cache of each computing unit in advance.
  • the data preloading method in this embodiment forms a data stream by sorting the original data and zero-filled data in time, and assigns each data in the data stream to the caches of different computing units, thereby improving data reuse and reducing memory.
  • the number of reads, the data preparation time is reduced, the delay between layer calculations is reduced, and the overall power consumption of the chip can be reduced. At the same time, it can be allocated while sorting, without adding additional memory to store intermediate data, reducing costs.
  • step S20 the difference between the data preloading method for a convolutional neural network in the second embodiment and the first embodiment is step S20, and the step S20 in the second embodiment includes the following steps:
  • Step S21' splicing each data of the original data set and each data of the zero-filling data set according to a preset spatial order to form an input data set;
  • Step S22' obtain the preset target address of each computing unit, and the target address is the space address corresponding to the required data when each described computing unit performs convolution calculation;
  • Step S23' Store the data with the same spatial address and target address in the input data set into the cache corresponding to each computing unit.
  • the original data set and the zero-padded data set are spliced in the spatial dimension to form the input data set. data set.
  • the cache can read all the required data in advance. As shown in Figure 6, assuming that the size of the input data is 5*5, the area inside the dashed box represents the original data set, and the area outside the dashed box represents the zero-padded data set.
  • each letter represents the spatial position of the original data and zero-padded data in the input data set, assuming that the size of the convolution kernel is 3*3, for the three calculation units, the target address of the first calculation unit is (A, B, C, F, G, H, K, L, M), the target address of the second calculation unit is (B, C, D, G, H, I, L, M, N), and the third calculation unit is The destination address is (C, D, E, H, I, J, M, N, O). Allocate the corresponding spatial address in the input data set to the cache of each computing unit.
  • the addresses of the data required by each calculation unit are discontinuous. If data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data, the loading process will take a long time. In this embodiment, data is stored in advance through the matching of spatial addresses, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.
  • the input data set formed by splicing is stored in the memory, and then in the form of a data stream, the data of each computing unit is sequentially matched and stored in spatial order, for example, the first row of data (A, B , C, D, E) are compared with the three calculation units in turn.
  • the target addresses of the three calculation units it can be known that three data (A, B, C, ) are stored in the cache of the first calculation unit, and three data are stored in the cache of the first calculation unit.
  • Three data (B, C, D) are stored in the cache of one computing unit, and three data (C, D, E) are stored in the cache of the third computing unit, and so on, to complete the allocation of each data.
  • the data may be sequentially allocated in other order, which is not limited here.
  • each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same.
  • the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the time numbers between different address segments are arranged at intervals.
  • the data with time number A is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse.
  • the target address is (A, B, C, F, G, H, K, L, M), including three address segments (A, B, C), (F, G, H), (K, L, M), it can be seen that the data between different sections are far apart in the spatial dimension, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. After searching for multiple addresses, the required data can be read, which will take a long time to load the data.
  • the present application reduces the data loading time in the calculation process by preloading each data into the cache of each computing unit in advance.
  • the data preloading method in this embodiment forms an input data set by splicing and storing the original data and the zero-filled data in spatial order, and assigns each data in the input data set to the caches of different computing units, thereby improving data complexity. It can solve the problem of low data loading efficiency caused by discontinuous data addresses obtained by the computing unit, reduce the number of memory reads, reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.
  • the data preloading device for convolutional neural network includes a data acquisition module 100 ′ and a data distribution module 200 ′, wherein the data acquisition module 100 ′ is used to acquire the original data set and the complementary data A zero data set, the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network.
  • the data distribution module 200 ′ is configured to store the data in the input data set in the buffer corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the buffers of each computing unit.
  • the target data set of wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
  • the original data is stored in the memory, the original data is transmitted to the data preloading device, and is allocated to the buffers of different computing units together with the zero-filling data generated by the data preloading device.
  • the data preloading apparatus further includes a configuration decoder 300', and the configuration decoder 300' is configured to generate memory address information and space information of the zero-fill data to be generated according to the configuration file, and the memory address information includes the address of memory access. Length, the space information of the zero-padded data includes the length of the zero-padded data.
  • the data acquisition module 100' includes a memory controller 101' and a zero-fill generator 102', and the memory controller 101 reads data from the memory according to the memory address information to form an original data set;
  • the generator 102 generates a zero-padded data set according to the spatial information.
  • the configuration decoder 300' is used to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is a convolution performed for each of the computing units.
  • the data distribution module 200' includes a time encoding unit 201', a sequence acquisition unit 202' and a data distribution unit 203'.
  • the time coding unit 201' is used to set time numbers for each data of the original data set and each data of the zero-padded data set according to the preset time sequence;
  • the sequence acquisition unit 202' is used to acquire each calculation unit
  • the data allocation unit 203' is configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in the cache corresponding to each computing unit.
  • the data processing process of the data distribution module 200' is described with reference to Embodiment 1, and will not be repeated here.
  • the data preloading device for convolutional neural network includes a data acquisition module 100 and a data distribution module 200 , wherein the data acquisition module 100 is used for acquiring the original data set and the zero-padded data set , the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network.
  • the data distribution module 200 is configured to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit respectively.
  • the target data set wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
  • the data preloading apparatus further includes a configuration decoder 300, and the configuration decoder 300 is configured to generate memory address information and space information of the zero-padded data to be generated according to the configuration file, and the memory address information includes the address length of the memory access. , the space information of the zero-padded data includes the length of the zero-padded data.
  • the data acquisition module 100 includes a memory controller 101 and a zero-fill generator 102, wherein the memory controller 101 reads data from the memory according to the memory address information to form an original data set; the zero-fill generator 102 according to the Spatial information generates zero-padded datasets.
  • the configuration decoder 300 is further configured to receive a configuration file and generate a preset spatial sequence and a target address of each computing unit according to the received configuration file, and the target address is that each of the computing units is performing convolution.
  • the space address corresponding to the data required for calculation.
  • the data distribution module 200 includes a data splicing unit 201, an address reading unit 202, a data distribution unit 203 and a data storage unit 204, wherein the data splicing unit 201 is used to combine each data of the original data set and the zero-filling data.
  • Each data of the set is spliced according to the preset spatial order to form an input data set; the data storage unit 204 is used to store the input data set; the address reading unit 202 is used to read the target address of each computing unit; the data distribution unit 203 is used to Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
  • the process of processing data by the data distribution module 200 is described with reference to the second embodiment and will not be repeated here.
  • a simulation experiment is carried out. It is implemented by Verilog HDL language, and the feasibility and running time of the design scheme are simulated and verified by Modelsim simulation tool.
  • the experimental plan is mainly to configure a configuration file for a specific neural network, write image data into the memory, and then give a start signal, and the simulation will be carried out automatically.
  • the experimental process is as follows: this experiment uses 64 computing units, the input image size is 28*28, the number of channels is 64, the output image size is 28*28, the number of channels is 128, and the convolution kernel size is 5*5.
  • the configuration file starts the convolution calculation operation.
  • Experiment 1 uses the data preloading algorithm of Embodiment 1 to allocate data, the total time from memory allocation to cache is 0.00504 milliseconds, and the calculation time of the computing unit is 0.00403 milliseconds;
  • Experiment 2 adopts the traditional calculation method, that is, reads from memory while calculating Data, the total time required to read all the data from memory is 0.05814 milliseconds, and the computation time of the computing unit is 0.00403 milliseconds. According to the experimental results, it can be seen that the time to read data by using the data preloading algorithm of the first embodiment is reduced by an order of magnitude compared with the time of the prior art, thereby greatly improving the operation efficiency.
  • the present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for a convolutional neural network, and the data preloading program for a convolutional neural network is executed by a processor At the same time, the data preloading method for the convolutional neural network in Embodiment 1 or Embodiment 2 is implemented.
  • the present application also discloses a computer device.
  • the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 .
  • the processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level.
  • one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device.
  • the computer-readable storage medium 11 stores a data preloading program for the convolutional neural network, and when the data preloading program for the convolutional neural network is executed by the processor, the above-mentioned data preloading program for the convolutional neural network is implemented. Data preload method.
  • Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase-change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory Memory

Abstract

Disclosed are a data pre-loading apparatus and a data pre-loading method, and a storage medium. The data pre-loading method comprises: acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set jointly form an input data set of a convolutional neural network; and before calculation units perform convolution calculation, storing, according to a predetermined allocation mode, data in the input data set in caches corresponding to the calculation units, so as to respectively form different target data sets in the caches of the calculation units, wherein data in the target data sets is data required by the calculation units during a convolution calculation process. Temporal sorting or spatial sorting is performed on original data and zero-padded data, an input data set is formed, and pieces of data in the input data set are allocated to caches of different calculation units, so that the data reusability is improved, the memory reading frequency is reduced, the data preparation time is shortened, the delay between layer calculations is reduced, and the overall power consumption of a chip is reduced.

Description

数据预加载装置及其预加载方法、计算机可读存储介质Data preloading apparatus and preloading method thereof, and computer readable storage medium 技术领域technical field
本发明属于集成电路技术领域,具体地讲,涉及用于卷积神经网络的数据预加载方法、计算机可读存储介质。The present invention belongs to the technical field of integrated circuits, and in particular, relates to a data preloading method and a computer-readable storage medium for a convolutional neural network.
背景技术Background technique
近年来,由于大数据应用的普及与计算机硬件的进步,深度学习技术被用来对数据进行特征提取,分类以及递归运算,其在计算机视觉,自然语言处理以及智能系统决策等领域有着广泛的应用。而卷积运算是一种非常重要的深度学习特征提取方法,如现在主流的深度学习神经网络如LeNet1,AlexNet,VGG-16,VGG-19等都是由一层层的卷积层堆叠而成的。而随着网络层数提高,分类的准确率得到提升。然而由于卷积运算的本身消耗大量的算力,通用计算机平台计算力以及速度跟不上,就需要设计出专用的卷积处理芯片。In recent years, due to the popularity of big data applications and the advancement of computer hardware, deep learning technology has been used to perform feature extraction, classification and recursive operations on data, and it has a wide range of applications in computer vision, natural language processing, and intelligent system decision-making. . The convolution operation is a very important deep learning feature extraction method. For example, the current mainstream deep learning neural networks such as LeNet1, AlexNet, VGG-16, VGG-19, etc. are composed of layers of convolution layers. of. As the number of network layers increases, the classification accuracy improves. However, since the convolution operation itself consumes a lot of computing power, the computing power and speed of the general computer platform cannot keep up, so it is necessary to design a dedicated convolution processing chip.
对于专用的卷积处理芯片,通过增加计算节点、加大数据缓存、改善数据类型的转化等优化方法,可以极大地提升计算单元计算的速度,但是由于目前的专用卷积处理芯片的数据读取方法还是传统方法,即处理器需要读取数据时,先在对应的缓存中查找,若缓存中没有相应的数据,则从外部内存中查找并读取相应的数据,这个过程需要花费较长的时间,这样会导致整体计算时间延迟。尤其是当多个处理器单元需要从外部内存读取数据时,多个处理器单元只能依序访问内存,这样会延长数据加载时间。For dedicated convolution processing chips, by adding computing nodes, increasing data cache, improving data type conversion and other optimization methods, the calculation speed of the computing unit can be greatly improved. However, due to the current dedicated convolution processing chips, the data read The method is still the traditional method, that is, when the processor needs to read data, it first searches in the corresponding cache. If there is no corresponding data in the cache, it searches and reads the corresponding data from the external memory. This process takes a long time. time, which can cause a delay in the overall computation time. Especially when multiple processor units need to read data from external memory, multiple processor units can only access the memory sequentially, which will prolong the data loading time.
因此,在提升计算单元运算速度的基础上,如何提升数据加载速度是本领域急需解决的技术问题。Therefore, on the basis of improving the operation speed of the computing unit, how to improve the data loading speed is a technical problem that needs to be solved urgently in the art.
发明内容SUMMARY OF THE INVENTION
(一)本发明所要解决的技术问题(1) Technical problem to be solved by the present invention
本发明解决的技术问题是:如何提升数据加载速度以加快卷积神经网络整体运算速度。The technical problem solved by the present invention is: how to improve the data loading speed to speed up the overall operation speed of the convolutional neural network.
(二)本发明所采用的技术方案(2) Technical scheme adopted in the present invention
一种用于卷积神经网络的数据预加载方法,所述数据预加载方法包括:A data preloading method for a convolutional neural network, the data preloading method comprising:
获取原始数据集和补零数据集,所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集;Obtaining an original data set and a zero-padded data set, the original data set and the zero-padded data set together constitute an input data set of the convolutional neural network;
在各个计算单元进行卷积计算之前,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit, wherein, The data in the target dataset is the data required by the computing unit in the convolution calculation process.
可选择地,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:Optionally, a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method includes:
将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;
获取各个计算单元的预先设置的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;
将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
可选择地,每个计算单元的目标地址为多个,且至少有两个计算单元的部分目标地址相同。Optionally, each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same.
可选择地,每个计算单元的目标地址包括多个地址区段,其中,同一个地址区段内的地址连续排列,不同地址区段之间的地址间隔排列。Optionally, the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the addresses between different address segments are arranged at intervals.
可选择地,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:Optionally, a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method includes:
对所述原始数据集的各个数据和所述补零数据集的各个数据按照预设时间顺序设置时间编号;Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;
获取各个计算单元的预先设置的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;
将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。The data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
可选择地,每个计算单元的目标序列为多个,且至少有两个计算单元的部分目标序列相同。Optionally, each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.
可选择地,每个计算单元的目标序列包括多个序列区段,其中,同一个序列区段内的时间编号连续排列,不同序列区段之间的时间编号间隔排列。Optionally, the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals.
本申请还公开了一种用于卷积神经网络的数据预加载装置,所述数据预加载装置包括:The present application also discloses a data preloading device for a convolutional neural network, the data preloading device comprising:
数据获取模块,用于获取原始数据集和补零数据集,其中所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集;a data acquisition module for acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network;
数据分配模块,用于在各个计算单元进行卷积计算之前,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。The data distribution module is used to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
可选择地,所述数据加载装置还包括配置解码器,所述配置解码器用于接收配置文件并根据接收到的配置文件生成预设空间顺序和各个计算单元的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;Optionally, the data loading device further includes a configuration decoder, and the configuration decoder is configured to receive a configuration file and generate a preset spatial sequence and target addresses of each computing unit according to the received configuration files, and the target addresses are each The spatial addresses corresponding to the data required by each of the computing units when performing convolution calculations;
所述数据分配模块包括:The data distribution module includes:
数据拼接单元,用于将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;a data splicing unit for splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;
数据存储单元,用于存储所述输入数据集;a data storage unit for storing the input data set;
地址读取单元,用于读取各个计算单元的目标地址;The address reading unit is used to read the target address of each computing unit;
数据分配单元,用于将将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。The data allocation unit is configured to store the data with the same spatial address as the target address in the input data set into the cache corresponding to each computing unit.
或者,所述数据加载装置还包括配置解码器,所述配置解码器用于接收配置文件并根据接收到的配置文件生成预设时间顺序和各个计算单元的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间 编号;Alternatively, the data loading device further includes a configuration decoder, which is configured to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is each The time number corresponding to the required data when the computing unit performs the convolution calculation;
所述数据分配模块包括:The data distribution module includes:
时间编码单元,用于根据所述预设时间顺序对所述原始数据集的各个数据和所述补零数据集的各个数据设置时间编号;a time coding unit, configured to set a time number for each data of the original data set and each data of the zero-padded data set according to the preset time sequence;
序列获取单元,用于获取各个计算单元的目标序列;The sequence acquisition unit is used to acquire the target sequence of each computing unit;
数据分配单元,用于将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。A data allocation unit, configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
可选择地,所述配置解码器还用于根据所述配置文件生成内存地址信息和待生成的补零数据的空间信息;Optionally, the configuration decoder is further configured to generate memory address information and space information of zero-padding data to be generated according to the configuration file;
所述数据获取模块包括:The data acquisition module includes:
内存控制器,用于根据所述内存地址信息从内存中读取数据,以形成原始数据集;a memory controller, configured to read data from the memory according to the memory address information to form an original data set;
补零生成器,用于根据所述空间信息生成补零数据集。A zero-padding generator, configured to generate a zero-padding data set according to the spatial information.
本发明还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有用于卷积神经网络的数据预加载程序,所述用于卷积神经网络的数据预加载程序被处理器执行时实现上述的用于卷积神经网络的数据预加载方法。The present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for the convolutional neural network, and the data preloading program for the convolutional neural network is executed by a processor The above data preloading method for convolutional neural network is implemented when
(三)有益效果(3) Beneficial effects
本发明公开了一种用于卷积神经网络的数据预加载方法,相对于传统的计算方法,具有如下技术效果:The invention discloses a data preloading method for a convolutional neural network, which has the following technical effects compared with the traditional calculation method:
(1)通过对原始数据和补零数据进行时间排序或空间排序,形成输入数据集,将输入数据集中各个数据分配至不同计算单元的缓存中,从而提高数据复用性,降低内存读取次数,降低数据准备时间,降低层计算间延迟,降低芯片整体功耗。(1) By sorting the original data and zero-filled data in time or space, an input data set is formed, and each data in the input data set is allocated to the cache of different computing units, thereby improving data reuse and reducing the number of memory reads. , reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.
附图说明Description of drawings
图1为本发明的实施例一的用于卷积神经网络的数据预加载方法的流程图;1 is a flowchart of a data preloading method for a convolutional neural network according to Embodiment 1 of the present invention;
图2为本发明的实施例一的对原始数据进行补零的示意图;2 is a schematic diagram of zero-filling original data according to Embodiment 1 of the present invention;
图3为本发明的实施例一的分配数据的流程图;3 is a flowchart of data allocation according to Embodiment 1 of the present invention;
图4为本发明的实施例一的输入数据集的示意图;4 is a schematic diagram of an input data set according to Embodiment 1 of the present invention;
图5为本发明的实施例二的分配数据的流程图;5 is a flowchart of data allocation according to Embodiment 2 of the present invention;
图6为本发明的实施例二的输入数据集的示意图;6 is a schematic diagram of an input data set according to Embodiment 2 of the present invention;
图7为本发明的实施例三的数据预加载装置的架构图;7 is an architectural diagram of a data preloading apparatus according to Embodiment 3 of the present invention;
图8为本发明的实施例三的数据预加载装置与内存、缓存的连接关系图;FIG. 8 is a connection diagram of a data preloading device, a memory, and a cache according to Embodiment 3 of the present invention;
图9为本发明的实施例四的数据预加载装置的架构图;FIG. 9 is an architectural diagram of a data preloading apparatus according to Embodiment 4 of the present invention;
图10为本发明的实施例的计算机设备的示意图。FIG. 10 is a schematic diagram of a computer device according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
在详细描述本申请的各个实施例之前,首先简单描述本申请的发明构思:现有技术中,处理器在进行计算过程,会从外部内存中读取数据并暂存在缓存中,下一次计算时如果需要新的数据,必须从外部内存中读取,由于先在处理器的计算速度较快,而读取数据的速度较慢,使得处理器在计算过程中“等待”数据的读取,造成了整体运算时间较长。本申请在计算单元开始计算之前把各个计算单元即将用到的数据先存储在对应的缓存中,在开始计算之后,计算单元只需从缓存中读取相应的数据,大大提高了读取速度,使得整体计算时间减少。Before describing the various embodiments of the present application in detail, first briefly describe the inventive concept of the present application: in the prior art, during the calculation process, the processor will read data from the external memory and temporarily store it in the cache. If new data is needed, it must be read from the external memory. Because the calculation speed of the processor is faster, and the speed of reading data is slower, the processor "waiting" for data reading during the calculation process, resulting in The overall operation time is longer. In the present application, the data to be used by each computing unit is stored in the corresponding cache before the computing unit starts to calculate. After the computing unit starts to calculate, the computing unit only needs to read the corresponding data from the cache, which greatly improves the reading speed. This reduces the overall computation time.
实施例一Example 1
具体地,如图1所示,本本实施例一的用于卷积神经网络的数据预加载方法包括如下步骤:Specifically, as shown in FIG. 1 , the data preloading method for a convolutional neural network in Embodiment 1 includes the following steps:
步骤S10:获取原始数据集和补零数据集,所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集。Step S10: Obtain an original data set and a zero-padded data set, where the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network.
步骤S20:在各个计算单元进行卷积计算之前,将所述输入数据集中的数 据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。Step S20: Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit respectively. , where the data in the target dataset is the data required by the computing unit in the convolution calculation process.
具体来说,当不同尺寸的图像输入到卷积神经网络中,经过卷积计算之后得到不同尺寸的输出图像。为了便于后续卷积层的计算,需要对原始图像数据进行补零处理,以形成特定尺寸的图像。经过首层卷积层处理之后,生成特定大小的输出图像。进一步地,在步骤S10中,根据配置参数从内存中读取原始数据集并生成相应的补零数据集,配置参数包括内存访问的地址长度、补零数据的长度等。示例性地,如图2所示,右图为输出图像,中间图为卷积核,左图为输入图像。其中,输出图像尺寸为28*28,计算单元为64个,从(0,0)至(2,7)共有64个输出数据,卷积核尺寸为5*5。左图中的虚线框表示尺寸为28*28的原始输入图像,即原始数据集,在虚线框外围各补充两行零数据,形成尺寸为32*32的输入图像,这样利用卷积核对输入图像中(0,0)至(6,11)的204个数据进行卷积操作即可得到输出图像中从(0,0)至(2,7)的64个输出数据。当然,上述属于示例性描述,补零数据的位置和长度根据实际需要进行确定,在此并不进行赘述。Specifically, when images of different sizes are input into the convolutional neural network, output images of different sizes are obtained after convolution calculations. In order to facilitate the calculation of subsequent convolutional layers, the original image data needs to be zero-padded to form an image of a specific size. After processing by the first convolutional layer, an output image of a specific size is generated. Further, in step S10, the original data set is read from the memory according to configuration parameters and a corresponding zero-padded data set is generated, and the configuration parameters include the address length of memory access, the length of zero-padded data, and the like. Exemplarily, as shown in Figure 2, the right image is the output image, the middle image is the convolution kernel, and the left image is the input image. Among them, the output image size is 28*28, the calculation unit is 64, there are 64 output data from (0, 0) to (2, 7), and the convolution kernel size is 5*5. The dotted box in the left figure represents the original input image with a size of 28*28, that is, the original data set. Two lines of zero data are added to the periphery of the dotted box to form an input image with a size of 32*32, so that the input image is checked by convolution The 204 data from (0, 0) to (6, 11) in the image can be obtained by convolution operation to obtain 64 output data from (0, 0) to (2, 7) in the output image. Of course, the above is an exemplary description, and the position and length of the zero-padding data are determined according to actual needs, which will not be repeated here.
进一步地,如图3所示,步骤S20包括如下步骤:Further, as shown in Figure 3, step S20 includes the following steps:
步骤S21:对所述原始数据集的各个数据和所述补零数据集的各个数据按照预设时间顺序设置时间编号;Step S21: setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;
步骤S22:获取各个计算单元的预先设置的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;Step S22: obtaining the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;
步骤S23:将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。Step S23: Store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
具体来说,通过获取事先设置的配置参数,包括各个原始数据和补零数据的时间先后顺序,将原始数据集和补零数据集在时间维度上进行“拼接”,形成输入数据流。将输入数据流中的各个数据的时间编号与各个计算单元在进行卷积计算时所需数据对应的时间编号即目标序列进行匹配,将匹配的数据存储到各个计算单元对应的缓存中,这样各个计算单元的缓存可以事先读取到全部所需数据。Specifically, by obtaining pre-set configuration parameters, including the chronological order of each original data and zero-padded data, the original data set and the zero-padded data set are "spliced" in the time dimension to form an input data stream. Match the time number of each data in the input data stream with the time number corresponding to the data required by each computing unit when performing the convolution calculation, that is, the target sequence, and store the matched data in the cache corresponding to each computing unit. The cache of the computing unit can read all the required data in advance.
示例性地,如图4所示,假设输入数据的尺寸为5*5,虚线框内区域表示原始数据集,虚线框外区域表示补零数据集,按照从(0,0)至(4,4)的先后顺序进行时间编号,各个数字表示时间编号,假设卷积核大小为3*3,对于其中三个计算单元,第一个计算单元的目标序列为(1、2、3、6、7、8、11、12、13),第二个计算单元的目标序列为(2、3、4、7、8、9、12、13、14),第三计算单元为目标序列为(3、4、5、8、9、10、13、14、15),这样则将输入数据中对应时间编号的数据分配至各个计算单元的缓存中。从上述各个计算单元的目标序列可知,每个计算单元所需的数据是不连续的,如果在计算过程从外部内存读取数据时,需要从首地址开始进行顺序查找,再读取匹配的数据,这个加载过程会花费较长时间。本实施例通过时间编号的匹配,事先存储数据,这样计算单元可直接从缓存中读取数据,极大地降低了整体计算时间。Exemplarily, as shown in FIG. 4 , assuming that the size of the input data is 5*5, the area inside the dashed box represents the original data set, and the area outside the dashed box represents the zero-padded data set, according to (0, 0) to (4, 4) The sequence of time is numbered, and each number represents the time number. Assuming that the size of the convolution kernel is 3*3, for the three calculation units, the target sequence of the first calculation unit is (1, 2, 3, 6, 7, 8, 11, 12, 13), the target sequence of the second calculation unit is (2, 3, 4, 7, 8, 9, 12, 13, 14), and the third calculation unit is the target sequence of (3 , 4, 5, 8, 9, 10, 13, 14, 15), in this way, the data corresponding to the time number in the input data is allocated to the buffer of each computing unit. It can be seen from the target sequences of the above calculation units that the data required by each calculation unit is discontinuous. If the data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data. , the loading process will take a long time. In this embodiment, data is stored in advance through the matching of time numbers, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.
作为优选实施例,将编号后的输入数据以数据流的形式,按照时间先后依次与各个计算单元的数据进行匹配并存储,例如将第一行数据流(1、2、3、4、5)依次与三个计算单元比较,根据三个计算单元的目标序列可知,向第一个计算单元的缓存中存储(1、2、3)三个数据,向第二个计算单元的缓存中存储(2、3、4)三个数据,向第三个计算单元的缓存中存储(3、4、5)三个数据,依次类推,完成各个数据的分配。当然在其他实施方式中,可以按照其他顺序依次分配数据,在此并不进行限制。As a preferred embodiment, the numbered input data is matched and stored with the data of each computing unit in the form of a data stream in chronological order, for example, the first row of data streams (1, 2, 3, 4, 5) Compared with the three computing units in turn, according to the target sequence of the three computing units, it can be known that three data (1, 2, 3) are stored in the cache of the first computing unit, and ( 2, 3, 4) three data, store (3, 4, 5) three data in the cache of the third computing unit, and so on, to complete the allocation of each data. Of course, in other implementations, the data may be sequentially allocated in other order, which is not limited here.
需要说明的是,本实施例一中对各个数据进行编号和分配可同时进行,即根据事先设置的配置参数,在读取各个原始数据和生成各个补零数据的同时,按照预定的时间顺序将各个数据直接写入到各个计算单元的缓存中,并不需要额外设置存储器件来存储中间数据。It should be noted that, in the first embodiment, the numbering and allocation of each data can be carried out simultaneously, that is, according to the configuration parameters set in advance, while reading each original data and generating each zero-filling data, according to a predetermined time sequence. Each data is directly written into the cache of each computing unit, and no additional storage device is required to store intermediate data.
作为优选实施例,每个计算单元的目标序列为多个,且至少有两个计算单元的部分目标序列相同。每个计算单元的目标序列包括多个序列区段,其中,同一个序列区段内的时间编号连续排列,不同序列区段之间的时间编号间隔排列。示例性地,对于上文中三个相邻的计算单元,时间编号为1的数据是三个计算单元都需要的,通过本申请的分配方式,一次性地将该数据存储到各个计算单元的缓存中,避免了传统做法中需要每个计算单元分别从内存中读取该数据,提高了数据复用度。针对第一个计算单元,其目标序列为(1、2、3、6、7、8、11、12、13),包括三个序列区段(1、2、3)、(6、7、8)、(11、12、13),可见区段之间的数据时间序列上相隔较远,对应到内存位置上的地址也 相离较远,如果按照传统做法,需要从首地址开始查找多个地址之后,才能读取到需要的数据,会花费较长的数据加载时间,本申请通过事先向各个计算单元的缓存中预加载各个数据,减少了计算过程中的数据加载时间。As a preferred embodiment, each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence. The target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers within the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged at intervals. Exemplarily, for the three adjacent computing units above, the data with time number 1 is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse. For the first calculation unit, the target sequence is (1, 2, 3, 6, 7, 8, 11, 12, 13), including three sequence segments (1, 2, 3), (6, 7, 8), (11, 12, 13), it can be seen that the data time series between the segments are far apart, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. It takes a long time to load the data before the required data can be read after the address is reached. The present application reduces the data load time in the calculation process by preloading each data into the cache of each computing unit in advance.
本实施例中的数据预加载方法,通过对原始数据和补零数据进行时间排序,形成数据流,将数据流中各个数据分配至不同计算单元的缓存中,从而提高数据复用性,降低内存读取次数,降低数据准备时间,降低层计算间延迟,降低芯片整体功耗,同时可以一边排序一边分配,不需要增加额外的存储器来存储中间数据,降低了成本。The data preloading method in this embodiment forms a data stream by sorting the original data and zero-filled data in time, and assigns each data in the data stream to the caches of different computing units, thereby improving data reuse and reducing memory. The number of reads, the data preparation time is reduced, the delay between layer calculations is reduced, and the overall power consumption of the chip can be reduced. At the same time, it can be allocated while sorting, without adding additional memory to store intermediate data, reducing costs.
实施例二 Embodiment 2
如图5所示,实施例二的用于卷积神经网络的数据预加载方法与实施例一的不同之处是步骤S20,本实施例二的步骤S20包括如下步骤:As shown in FIG. 5 , the difference between the data preloading method for a convolutional neural network in the second embodiment and the first embodiment is step S20, and the step S20 in the second embodiment includes the following steps:
步骤S21’:将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;Step S21': splicing each data of the original data set and each data of the zero-filling data set according to a preset spatial order to form an input data set;
步骤S22’:获取各个计算单元的预先设置的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;Step S22': obtain the preset target address of each computing unit, and the target address is the space address corresponding to the required data when each described computing unit performs convolution calculation;
步骤S23’:将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。Step S23': Store the data with the same spatial address and target address in the input data set into the cache corresponding to each computing unit.
具体来说,通过获取事先设置的配置参数,包括各个原始数据和补零数据在待形成的输入数据集中的空间位置等,将原始数据集和补零数据集在空间维度上进行拼接,形成输入数据集。将输入数据集的各个数据的空间地址与各个计算单元在进行卷积计算时所需数据对应的地址即目标地址进行匹配,将匹配的数据存储到各个计算单元对应的缓存中,这样各个计算单元的缓存可以事先读取到全部所需数据。如图6所示,假设输入数据的尺寸为5*5,虚线框内区域表示原始数据集,虚线框外区域表示补零数据集,按照从(0,0)至(4,4)的空间顺序拼接,各个字母表示原始数据和补零数据在输入数据集中空间位置,假设卷积核大小为3*3,对于其中三个计算单元,第一个计算单元的目标地址为(A、B、C、F、G、H、K、L、M),第二个计算单元的目标地址为(B、C、D、G、H、I、L、M、N),第三个计算单元为目标地址为(C、D、E、H、I、J、M、N、O)。将输入数据集中的对应空间地址的分配至各个计算单元的缓存 中。从上述各个计算单元的目标地址可知,每个计算单元所需数据的地址是不连续的,如果在计算过程中从外部内存读取数据时,需要从首地址开始进行顺序查找,再读取匹配的数据,这个加载过程会花费较长时间。本实施例通过空间地址的匹配,事先存储数据,这样计算单元可直接从缓存中读取数据,极大地降低了整体计算时间。Specifically, by obtaining pre-set configuration parameters, including the spatial position of each original data and zero-padded data in the input data set to be formed, etc., the original data set and the zero-padded data set are spliced in the spatial dimension to form the input data set. data set. Match the spatial address of each data of the input data set with the address corresponding to the data required by each computing unit when performing the convolution calculation, that is, the target address, and store the matched data in the cache corresponding to each computing unit, so that each computing unit The cache can read all the required data in advance. As shown in Figure 6, assuming that the size of the input data is 5*5, the area inside the dashed box represents the original data set, and the area outside the dashed box represents the zero-padded data set. According to the space from (0, 0) to (4, 4) Sequential splicing, each letter represents the spatial position of the original data and zero-padded data in the input data set, assuming that the size of the convolution kernel is 3*3, for the three calculation units, the target address of the first calculation unit is (A, B, C, F, G, H, K, L, M), the target address of the second calculation unit is (B, C, D, G, H, I, L, M, N), and the third calculation unit is The destination address is (C, D, E, H, I, J, M, N, O). Allocate the corresponding spatial address in the input data set to the cache of each computing unit. It can be seen from the target addresses of the above calculation units that the addresses of the data required by each calculation unit are discontinuous. If data is read from the external memory during the calculation process, it is necessary to search sequentially from the first address, and then read the matching data, the loading process will take a long time. In this embodiment, data is stored in advance through the matching of spatial addresses, so that the computing unit can directly read data from the cache, which greatly reduces the overall computing time.
作为优选实施例,将拼接形成的输入数据集存储在存储器中,接着将以数据流的形式,按照空间顺序依次与各个计算单元的数据进行匹配并存储,例如将第一行数据(A、B、C、D、E)依次与三个计算单元比较,根据三个计算单元的目标地址可知,向第一个计算单元的缓存中存储(A、B、C、)三个数据,向第二个计算单元的缓存中存储(B、C、D)三个数据,向第三个计算单元的缓存中存储(C、D、E)三个数据,依次类推,完成各个数据的分配。当然在其他实施方式中,可以按照其他顺序依次分配数据,在此并不进行限制。As a preferred embodiment, the input data set formed by splicing is stored in the memory, and then in the form of a data stream, the data of each computing unit is sequentially matched and stored in spatial order, for example, the first row of data (A, B , C, D, E) are compared with the three calculation units in turn. According to the target addresses of the three calculation units, it can be known that three data (A, B, C, ) are stored in the cache of the first calculation unit, and three data are stored in the cache of the first calculation unit. Three data (B, C, D) are stored in the cache of one computing unit, and three data (C, D, E) are stored in the cache of the third computing unit, and so on, to complete the allocation of each data. Of course, in other implementations, the data may be sequentially allocated in other order, which is not limited here.
作为优选实施例,每个计算单元的目标地址为多个,且至少有两个计算单元的部分目标地址相同。每个计算单元的目标地址包括多个地址区段,其中,同一个地址区段内的地址连续排列,不同地址区段之间的时间编号间隔排列。示例性地,对于上文中三个相邻的计算单元,时间编号为A的数据是三个计算单元都需要的,通过本申请的分配方式,一次性地将该数据存储到各个计算单元的缓存中,避免了传统做法中需要每个计算单元分别从内存中读取该数据,提高了数据复用度。针对第一个计算单元,其目标地址为(A、B、C、F、G、H、K、L、M),包括三个地址区段(A、B、C)、(F、G、H)、(K、L、M),可见不同区段之间的数据在空间维度上相隔较远,对应到内存位置上的地址也相离较远,如果按照传统做法,需要从首地址开始查找多个地址之后,才能读取到需要的数据,会花费较长的数据加载时间,本申请通过事先向各个计算单元的缓存中预加载各个数据,减少了计算过程中的数据加载时间。As a preferred embodiment, each computing unit has multiple target addresses, and at least two computing units have partial target addresses that are the same. The target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are arranged consecutively, and the time numbers between different address segments are arranged at intervals. Exemplarily, for the above three adjacent computing units, the data with time number A is required by all three computing units, and through the allocation method of the present application, the data is stored in the cache of each computing unit at one time. It avoids the need for each computing unit to read the data from the memory separately in the traditional practice, and improves the data reuse. For the first computing unit, its target address is (A, B, C, F, G, H, K, L, M), including three address segments (A, B, C), (F, G, H), (K, L, M), it can be seen that the data between different sections are far apart in the spatial dimension, and the addresses corresponding to the memory locations are also far apart. If you follow the traditional method, you need to start from the first address. After searching for multiple addresses, the required data can be read, which will take a long time to load the data. The present application reduces the data loading time in the calculation process by preloading each data into the cache of each computing unit in advance.
本实施例中的数据预加载方法,通过对原始数据和补零数据按照空间顺序进行拼接并存储,形成输入数据集,将输入数据集中各个数据分配至不同计算单元的缓存中,从而提高数据复用性,可以解决计算单元获取数据地址不连续导致数据加载效率低的问题,降低内存读取次数,降低数据准备时间,降低层计算间延迟,降低芯片整体功耗。The data preloading method in this embodiment forms an input data set by splicing and storing the original data and the zero-filled data in spatial order, and assigns each data in the input data set to the caches of different computing units, thereby improving data complexity. It can solve the problem of low data loading efficiency caused by discontinuous data addresses obtained by the computing unit, reduce the number of memory reads, reduce the data preparation time, reduce the delay between layer calculations, and reduce the overall power consumption of the chip.
实施例三 Embodiment 3
如图7所示,本实施例三的用于卷积神经网络的数据预加载装置包括数据获取模块100’和数据分配模块200’,其中,数据获取模块100’用于获取原始数据集和补零数据集,所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集。数据分配模块200’用于在各个计算单元进行卷积计算之前,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。如图8所示,原始数据存储在内存中,原始数据传送到数据预加载装置中,并与数据预加载装置生成的补零数据一起被分配到不同计算单元的缓存中。As shown in FIG. 7 , the data preloading device for convolutional neural network according to the third embodiment includes a data acquisition module 100 ′ and a data distribution module 200 ′, wherein the data acquisition module 100 ′ is used to acquire the original data set and the complementary data A zero data set, the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network. The data distribution module 200 ′ is configured to store the data in the input data set in the buffer corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the buffers of each computing unit. The target data set of , wherein the data in the target data set is the data required by the computing unit in the convolution calculation process. As shown in FIG. 8 , the original data is stored in the memory, the original data is transmitted to the data preloading device, and is allocated to the buffers of different computing units together with the zero-filling data generated by the data preloading device.
进一步地,数据预加载装置还包括配置解码器300’,配置解码器300’用于根据所述配置文件生成内存地址信息和待生成的补零数据的空间信息,内存地址信息包括内存访问的地址长度,补零数据的空间信息包括补零数据的长度。Further, the data preloading apparatus further includes a configuration decoder 300', and the configuration decoder 300' is configured to generate memory address information and space information of the zero-fill data to be generated according to the configuration file, and the memory address information includes the address of memory access. Length, the space information of the zero-padded data includes the length of the zero-padded data.
进一步地,所述数据获取模块100’包括内存控制器101’和补零生成器102’,内存控制器101根据所述内存地址信息从内存中读取数据,以形成原始数据集;补零生成器102根据所述空间信息生成补零数据集。Further, the data acquisition module 100' includes a memory controller 101' and a zero-fill generator 102', and the memory controller 101 reads data from the memory according to the memory address information to form an original data set; The generator 102 generates a zero-padded data set according to the spatial information.
进一步地,所述配置解码器300’用于接收配置文件并根据接收到的配置文件生成预设时间顺序和各个计算单元的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;Further, the configuration decoder 300' is used to receive a configuration file and generate a preset time sequence and a target sequence of each computing unit according to the received configuration file, and the target sequence is a convolution performed for each of the computing units. The time number corresponding to the data required for the calculation;
所述数据分配模块200’包括时间编码单元201’、序列获取单元202’和数据分配单元203’。其中,时间编码单元201’用于根据所述预设时间顺序对所述原始数据集的各个数据和所述补零数据集的各个数据设置时间编号;序列获取单元202’用于获取各个计算单元的目标序列;数据分配单元203’用于将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。数据分配模块200’的处理数据的过程参照实施例一中描述在此不进行赘述。The data distribution module 200' includes a time encoding unit 201', a sequence acquisition unit 202' and a data distribution unit 203'. Wherein, the time coding unit 201' is used to set time numbers for each data of the original data set and each data of the zero-padded data set according to the preset time sequence; the sequence acquisition unit 202' is used to acquire each calculation unit The data allocation unit 203' is configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in the cache corresponding to each computing unit. The data processing process of the data distribution module 200' is described with reference to Embodiment 1, and will not be repeated here.
实施例四 Embodiment 4
如图9所示,本实施例四的用于卷积神经网络的数据预加载装置包括数据获取模块100和数据分配模块200,其中,数据获取模块100用于获取原始数据集和补零数据集,所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集。数据分配模块200用于在各个计算单元进行卷积计算之前,将 所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。As shown in FIG. 9 , the data preloading device for convolutional neural network according to the fourth embodiment includes a data acquisition module 100 and a data distribution module 200 , wherein the data acquisition module 100 is used for acquiring the original data set and the zero-padded data set , the original data set and the zero-padded data set together constitute the input data set of the convolutional neural network. The data distribution module 200 is configured to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit respectively. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
具体来说,数据预加载装置还包括配置解码器300,配置解码器300用于根据所述配置文件生成内存地址信息和待生成的补零数据的空间信息,内存地址信息包括内存访问的地址长度,补零数据的空间信息包括补零数据的长度。Specifically, the data preloading apparatus further includes a configuration decoder 300, and the configuration decoder 300 is configured to generate memory address information and space information of the zero-padded data to be generated according to the configuration file, and the memory address information includes the address length of the memory access. , the space information of the zero-padded data includes the length of the zero-padded data.
所述数据获取模块100包括内存控制器101和补零生成器102,内存控制器101其中根据所述内存地址信息从内存中读取数据,以形成原始数据集;补零生成器102根据所述空间信息生成补零数据集。The data acquisition module 100 includes a memory controller 101 and a zero-fill generator 102, wherein the memory controller 101 reads data from the memory according to the memory address information to form an original data set; the zero-fill generator 102 according to the Spatial information generates zero-padded datasets.
进一步地,所述配置解码器300还用于接收配置文件并根据接收到的配置文件生成预设空间顺序和各个计算单元的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址。Further, the configuration decoder 300 is further configured to receive a configuration file and generate a preset spatial sequence and a target address of each computing unit according to the received configuration file, and the target address is that each of the computing units is performing convolution. The space address corresponding to the data required for calculation.
所述数据分配模块200包括数据拼接单元201、地址读取单元202、数据分配单元203和数据存储单元204,其中数据拼接单元201用于将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;数据存储单元204用于存储所述输入数据集;地址读取单元202用于读取各个计算单元的目标地址;数据分配单元203用于将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。数据分配模块200的处理数据的过程参照实施例二中描述在此不进行赘述。The data distribution module 200 includes a data splicing unit 201, an address reading unit 202, a data distribution unit 203 and a data storage unit 204, wherein the data splicing unit 201 is used to combine each data of the original data set and the zero-filling data. Each data of the set is spliced according to the preset spatial order to form an input data set; the data storage unit 204 is used to store the input data set; the address reading unit 202 is used to read the target address of each computing unit; the data distribution unit 203 is used to Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit. The process of processing data by the data distribution module 200 is described with reference to the second embodiment and will not be repeated here.
为了验证实施例一的数据预加载方法的技术效果,进行了仿真实验。采用Verilog HDL语言实现,采用Modelsim仿真工具对设计方案的可行性和运行时间进行仿真验证。实验方案主要为,针对一个特定的神经网络配置好配置文件,往内存中写入图像数据,接着给启动信号,仿真就自动进行了,当实验完成时再通过Modelsim观察内存记录实验结果。In order to verify the technical effect of the data preloading method of the first embodiment, a simulation experiment is carried out. It is implemented by Verilog HDL language, and the feasibility and running time of the design scheme are simulated and verified by Modelsim simulation tool. The experimental plan is mainly to configure a configuration file for a specific neural network, write image data into the memory, and then give a start signal, and the simulation will be carried out automatically.
实验过程如下:本实验以64个计算单元,输入图像尺寸为28*28,通道数为64,输出图像尺寸28*28通道数为128,卷积核尺寸为5*5,当芯片读取完配置文件就开始卷积计算操作。实验一采用实施例一的数据预加载算法分配数据,从内存分配至缓存的总时间为0.00504毫秒,计算单元的计算时间为0.00403毫秒;实验二采用传统计算方法,即一边计算一边从内存读取数据,从内存读取到全部数据所需的总时间为0.05814毫秒,计算单元的计算时间为 0.00403毫秒。根据实验结果可知,采用实施例一的数据预加载算法读取数据的时间相对于现有技术的时间减少了一个数量级,从而极大地提升了运算效率。The experimental process is as follows: this experiment uses 64 computing units, the input image size is 28*28, the number of channels is 64, the output image size is 28*28, the number of channels is 128, and the convolution kernel size is 5*5. The configuration file starts the convolution calculation operation. Experiment 1 uses the data preloading algorithm of Embodiment 1 to allocate data, the total time from memory allocation to cache is 0.00504 milliseconds, and the calculation time of the computing unit is 0.00403 milliseconds; Experiment 2 adopts the traditional calculation method, that is, reads from memory while calculating Data, the total time required to read all the data from memory is 0.05814 milliseconds, and the computation time of the computing unit is 0.00403 milliseconds. According to the experimental results, it can be seen that the time to read data by using the data preloading algorithm of the first embodiment is reduced by an order of magnitude compared with the time of the prior art, thereby greatly improving the operation efficiency.
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有用于卷积神经网络的数据预加载程序,所述用于卷积神经网络的数据预加载程序被处理器执行时实现实施例一或实施例二的用于卷积神经网络的数据预加载方法。The present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data preloading program for a convolutional neural network, and the data preloading program for a convolutional neural network is executed by a processor At the same time, the data preloading method for the convolutional neural network in Embodiment 1 or Embodiment 2 is implemented.
本申请还公开了一种计算机设备,在硬件层面,如图10所示,该终端包括处理器12、内部总线13、网络接口14、计算机可读存储介质11。处理器12从计算机可读存储介质中读取对应的计算机程序然后运行,在逻辑层面上形成请求处理装置。当然,除了软件实现方式之外,本说明书一个或多个实施例并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。所述计算机可读存储介质11上存储有用于卷积神经网络的数据预加载程序,所述用于卷积神经网络的数据预加载程序被处理器执行时实现上述的用于卷积神经网络的数据预加载方法。The present application also discloses a computer device. At the hardware level, as shown in FIG. 10 , the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 . The processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device. The computer-readable storage medium 11 stores a data preloading program for the convolutional neural network, and when the data preloading program for the convolutional neural network is executed by the processor, the above-mentioned data preloading program for the convolutional neural network is implemented. Data preload method.
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
上面对本发明的具体实施方式进行了详细描述,虽然已表示和描述了一些实施例,但本领域技术人员应该理解,在不脱离由权利要求及其等同物限定其范围的本发明的原理和精神的情况下,可以对这些实施例进行修改和完善,这些修改和完善也应在本发明的保护范围内。The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that the principles and spirit of the present invention, which are defined in the scope of the claims and their equivalents, are not departed from. Under the circumstances, these embodiments can be modified and perfected, and these modifications and improvements should also fall within the protection scope of the present invention.

Claims (18)

  1. 一种用于卷积神经网络的数据预加载方法,其中,所述数据预加载方法包括:A data preloading method for a convolutional neural network, wherein the data preloading method comprises:
    获取原始数据集和补零数据集,所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集;Obtaining an original data set and a zero-padded data set, the original data set and the zero-padded data set together constitute an input data set of the convolutional neural network;
    在各个计算单元进行卷积计算之前,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。Before each computing unit performs the convolution calculation, the data in the input data set is stored in the cache corresponding to each computing unit according to a predetermined distribution method, so as to form different target data sets in the cache of each computing unit, wherein, The data in the target dataset is the data required by the computing unit in the convolution calculation process.
  2. 根据权利要求1所述的用于卷积神经网络的数据预加载方法,其中,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:The data preloading method for a convolutional neural network according to claim 1, wherein the specific method for storing the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method comprises:
    将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;
    获取各个计算单元的预先设置的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;
    将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。Data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
  3. 根据权利要求2所述的用于卷积神经网络的数据预加载方法,其中,每个计算单元的目标地址为多个,且至少有两个计算单元的部分目标地址相同。The data preloading method for a convolutional neural network according to claim 2, wherein each computing unit has multiple target addresses, and at least two computing units have the same partial target addresses.
  4. 根据权利要求3所述的用于卷积神经网络的数据预加载方法,其中,每个计算单元的目标地址包括多个地址区段,其中,同一个地址区段内的地址连续排列,不同地址区段之间的地址间隔排列。The data preloading method for a convolutional neural network according to claim 3, wherein the target address of each computing unit includes a plurality of address segments, wherein the addresses in the same address segment are consecutively arranged, and different addresses are arranged in a row. The addresses are spaced between sections.
  5. 根据权利要求1所述的用于卷积神经网络的数据预加载方法,其中,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:The data preloading method for a convolutional neural network according to claim 1, wherein the specific method for storing the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method comprises:
    对所述原始数据集的各个数据和所述补零数据集的各个数据按照预设时间顺序设置时间编号;Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;
    获取各个计算单元的预先设置的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;
    将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。The data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
  6. 根据权利要求5所述的用于卷积神经网络的数据预加载方法,其中,每个计算单元的目标序列为多个,且至少有两个计算单元的部分目标序列相同。The data preloading method for a convolutional neural network according to claim 5, wherein each computing unit has multiple target sequences, and at least two computing units have the same partial target sequence.
  7. 根据权利要求6所述的用于卷积神经网络的数据预加载方法,其中,每个计算单元的目标序列包括多个序列区段,其中,同一个序列区段内的时间编号连续排列,不同序列区段之间的时间编号间隔排列。The data preloading method for a convolutional neural network according to claim 6, wherein the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers in the same sequence segment are consecutively arranged, and different The time-numbered intervals between the sequence segments are arranged.
  8. 一种用于卷积神经网络的数据预加载装置,其中,所述数据预加载装置包括:A data preloading device for a convolutional neural network, wherein the data preloading device comprises:
    数据获取模块,用于获取原始数据集和补零数据集,其中所述原始数据集和所述补零数据集共同构成卷积神经网络的输入数据集;a data acquisition module for acquiring an original data set and a zero-padded data set, wherein the original data set and the zero-padded data set together constitute an input data set of a convolutional neural network;
    数据分配模块,用于在各个计算单元进行卷积计算之前,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中,以分别在各个计算单元的缓存中形成不同的目标数据集,其中,目标数据集中的数据为计算单元在卷积计算过程中所需的数据。The data distribution module is used to store the data in the input data set in the cache corresponding to each computing unit according to a predetermined distribution method before each computing unit performs the convolution calculation, so as to form different data in the cache of each computing unit. The target data set, wherein the data in the target data set is the data required by the computing unit in the convolution calculation process.
  9. 根据权利要求8所述的用于卷积神经网络的数据预加载装置,其中,所述数据加载装置还包括配置解码器,所述配置解码器用于接收配置文件并根据接收到的配置文件生成预设空间顺序和各个计算单元的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;The data preloading device for a convolutional neural network according to claim 8, wherein the data loading device further comprises a configuration decoder, the configuration decoder is configured to receive a configuration file and generate a preload according to the received configuration file. Set the spatial order and the target address of each computing unit, and the target address is the spatial address corresponding to the required data when each described computing unit performs the convolution calculation;
    所述数据分配模块包括:The data distribution module includes:
    数据拼接单元,用于将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;a data splicing unit for splicing each data of the original data set and each data of the zero-padded data set according to a preset spatial order to form an input data set;
    数据存储单元,用于存储所述输入数据集;a data storage unit for storing the input data set;
    地址读取单元,用于读取各个计算单元的目标地址;The address reading unit is used to read the target address of each computing unit;
    数据分配单元,用于将将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单元对应的缓存中。The data allocation unit is configured to store the data with the same spatial address as the target address in the input data set into the cache corresponding to each computing unit.
  10. 根据权利要求8所述的用于卷积神经网络的数据预加载装置,其中,所述数据加载装置还包括配置解码器,所述配置解码器用于接收配置文件并根据接收到的配置文件生成预设时间顺序和各个计算单元的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;The data preloading device for a convolutional neural network according to claim 8, wherein the data loading device further comprises a configuration decoder, the configuration decoder is configured to receive a configuration file and generate a preload according to the received configuration file. Set the time sequence and the target sequence of each computing unit, and the target sequence is the time number corresponding to the required data when each described computing unit performs convolution calculation;
    所述数据分配模块包括:The data distribution module includes:
    时间编码单元,用于根据所述预设时间顺序对所述原始数据集的各个数据和所述补零数据集的各个数据设置时间编号;a time coding unit, configured to set a time number for each data of the original data set and each data of the zero-padded data set according to the preset time sequence;
    序列获取单元,用于获取各个计算单元的目标序列;The sequence acquisition unit is used to acquire the target sequence of each computing unit;
    数据分配单元,用于将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。A data allocation unit, configured to store the data in the original data set and the zero-padded data set with the same time number as the target sequence in a cache corresponding to each computing unit.
  11. 根据权利要求9所述的用于卷积神经网络的数据预加载装置,其中,所述配置解码器还用于根据所述配置文件生成内存地址信息和待生成的补零数据的空间信息;The data preloading device for a convolutional neural network according to claim 9, wherein the configuration decoder is further configured to generate memory address information and space information of the zero-padded data to be generated according to the configuration file;
    所述数据获取模块包括:The data acquisition module includes:
    内存控制器,用于根据所述内存地址信息从内存中读取数据,以形成原始数据集;a memory controller, configured to read data from the memory according to the memory address information to form an original data set;
    补零生成器,用于根据所述空间信息生成补零数据集。A zero-padding generator, configured to generate a zero-padding data set according to the spatial information.
  12. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有用于卷积神经网络的数据预加载程序,所述用于卷积神经网络的数据预加载程序被处理器执行时实现权利要求1所述的用于卷积神经网络的数据预加载方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a data preloading program for a convolutional neural network, and the data preloading program for a convolutional neural network implements rights when executed by a processor The data preloading method for convolutional neural network described in claim 1.
  13. 根据权利要求12所述的计算机可读存储介质,其中,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:The computer-readable storage medium according to claim 12, wherein a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method comprises:
    将所述原始数据集的各个数据和所述补零数据集的各个数据按照预设空间顺序拼接形成输入数据集;Each data of the original data set and each data of the zero-padded data set are spliced according to a preset spatial order to form an input data set;
    获取各个计算单元的预先设置的目标地址,所述目标地址为每个所述计算单元在进行卷积计算时所需数据对应的空间地址;Obtain the preset target address of each computing unit, and the target address is the space address corresponding to the data required by each of the computing units when performing the convolution calculation;
    将所述输入数据集中空间地址与目标地址相同的数据存储到每个计算单 元对应的缓存中。The data with the same spatial address as the target address in the input data set is stored in the cache corresponding to each computing unit.
  14. 根据权利要求13所述的计算机可读存储介质,其中,每个计算单元的目标地址为多个,且至少有两个计算单元的部分目标地址相同。The computer-readable storage medium according to claim 13, wherein each computing unit has a plurality of target addresses, and at least two computing units have the same partial target address.
  15. 根据权利要求14所述的计算机可读存储介质,其中,每个计算单元的目标地址包括多个地址区段,其中,同一个地址区段内的地址连续排列,不同地址区段之间的地址间隔排列。The computer-readable storage medium according to claim 14, wherein the target address of each computing unit includes a plurality of address segments, wherein addresses within a same address segment are arranged consecutively, and addresses between different address segments are arranged in a row. spaced.
  16. 根据权利要求12所述的计算机可读存储介质,其中,将所述输入数据集中的数据按照预定分配方式存储至各个计算单元对应的缓存中的具体方法包括:The computer-readable storage medium according to claim 12, wherein a specific method for storing the data in the input data set in a cache corresponding to each computing unit according to a predetermined distribution method comprises:
    对所述原始数据集的各个数据和所述补零数据集的各个数据按照预设时间顺序设置时间编号;Setting a time number for each data of the original data set and each data of the zero-padded data set according to a preset time sequence;
    获取各个计算单元的预先设置的目标序列,所述目标序列为每个所述计算单元在进行卷积计算时所需数据对应的时间编号;Obtain the preset target sequence of each computing unit, and the target sequence is the time number corresponding to the data required by each of the computing units when performing the convolution calculation;
    将所述原始数据集和所述补零数据集中时间编号与目标序列相同的数据存储到每个计算单元对应的缓存中。Data in the original data set and the zero-padded data set with the same time number as the target sequence is stored in a cache corresponding to each computing unit.
  17. 根据权利要求16所述的计算机可读存储介质,其中,每个计算单元的目标序列为多个,且至少有两个计算单元的部分目标序列相同。The computer-readable storage medium according to claim 16, wherein the target sequence of each computing unit is multiple, and at least two computing units have the same partial target sequence.
  18. 根据权利要求17所述的计算机可读存储介质,其中,每个计算单元的目标序列包括多个序列区段,其中,同一个序列区段内的时间编号连续排列,不同序列区段之间的时间编号间隔排列。The computer-readable storage medium according to claim 17, wherein the target sequence of each computing unit includes a plurality of sequence segments, wherein the time numbers in the same sequence segment are consecutively arranged, and the time numbers between different sequence segments are arranged consecutively. The time numbers are arranged at intervals.
PCT/CN2020/106761 2020-07-29 2020-08-04 Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium WO2022021459A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010742731.9A CN114090470B (en) 2020-07-29 2020-07-29 Data preloading device and preloading method thereof, storage medium and computer equipment
CN202010742731.9 2020-07-29

Publications (1)

Publication Number Publication Date
WO2022021459A1 true WO2022021459A1 (en) 2022-02-03

Family

ID=80037382

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/106761 WO2022021459A1 (en) 2020-07-29 2020-08-04 Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN114090470B (en)
WO (1) WO2022021459A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109165728A (en) * 2018-08-06 2019-01-08 济南浪潮高新科技投资发展有限公司 A kind of basic computational ele- ment and calculation method of convolutional neural networks
CN109359729A (en) * 2018-09-13 2019-02-19 深思考人工智能机器人科技(北京)有限公司 It is a kind of to realize data cached system and method on FPGA
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
CN110766150A (en) * 2019-10-15 2020-02-07 北京芯启科技有限公司 Regional parallel data loading device and method in deep convolutional neural network hardware accelerator

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102631381B1 (en) * 2016-11-07 2024-01-31 삼성전자주식회사 Convolutional neural network processing method and apparatus
CN106874219B (en) * 2016-12-23 2018-11-02 深圳云天励飞技术有限公司 A kind of data dispatching method of convolutional neural networks, system and computer equipment
CN107894957B (en) * 2017-11-14 2020-09-01 河南鼎视智能科技有限公司 Convolutional neural network-oriented memory data access and zero insertion method and device
WO2019119301A1 (en) * 2017-12-20 2019-06-27 华为技术有限公司 Method and device for determining feature image in convolutional neural network model
US20210201124A1 (en) * 2018-08-27 2021-07-01 Neuralmagic Inc. Systems and methods for neural network convolutional layer matrix multiplication using cache memory
CN109460813B (en) * 2018-09-10 2022-02-15 中国科学院深圳先进技术研究院 Acceleration method, device and equipment for convolutional neural network calculation and storage medium
CN110163338B (en) * 2019-01-31 2024-02-02 腾讯科技(深圳)有限公司 Chip operation method and device with operation array, terminal and chip

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180341495A1 (en) * 2017-05-26 2018-11-29 Purdue Research Foundation Hardware Accelerator for Convolutional Neural Networks and Method of Operation Thereof
CN109165728A (en) * 2018-08-06 2019-01-08 济南浪潮高新科技投资发展有限公司 A kind of basic computational ele- ment and calculation method of convolutional neural networks
CN109359729A (en) * 2018-09-13 2019-02-19 深思考人工智能机器人科技(北京)有限公司 It is a kind of to realize data cached system and method on FPGA
CN109934339A (en) * 2019-03-06 2019-06-25 东南大学 A kind of general convolutional neural networks accelerator based on a dimension systolic array
CN110766150A (en) * 2019-10-15 2020-02-07 北京芯启科技有限公司 Regional parallel data loading device and method in deep convolutional neural network hardware accelerator

Also Published As

Publication number Publication date
CN114090470A (en) 2022-02-25
CN114090470B (en) 2023-02-17

Similar Documents

Publication Publication Date Title
US11915139B2 (en) Modifying machine learning models to improve locality
US20230237313A1 (en) Layout Parasitics and Device Parameter Prediction using Graph Neural Networks
WO2020134703A1 (en) Neural network system-based image processing method and neural network system
TW202207154A (en) Video matching method and infringement evidence storage method and device based on block chain
EP3686816A1 (en) Techniques for removing masks from pruned neural networks
WO2022166673A1 (en) Transaction processing method and apparatus in blockchain, and electronic device
TW202046184A (en) Neural network search method and apparatus
CN104346384A (en) Method and device for processing small files
US10031947B2 (en) Method and apparatus for performing a search operation on heterogeneous computing systems
US20210027148A1 (en) Compression of neural network activation data
WO2020073801A1 (en) Data reading/writing method and system in 3d image processing, storage medium, and terminal
CN110490302B (en) Neural network compiling and optimizing method and device and related products
CN116401502A (en) Method and device for optimizing Winograd convolution based on NUMA system characteristics
US9135984B2 (en) Apparatuses and methods for writing masked data to a buffer
US11409798B2 (en) Graph processing system including different kinds of memory devices, and operation method thereof
Lin et al. Training kinetics in 15 minutes: Large-scale distributed training on videos
WO2022021459A1 (en) Data pre-loading apparatus and data pre-loading method, and computer-readable storage medium
CN110837483B (en) Tensor dimension transformation method and device
US11567690B2 (en) Semiconductor memory device and electronic system the same
US11625578B2 (en) Neural network processing
US9183435B2 (en) Feature generalization using topological model
WO2023015560A1 (en) Systems and methods for sparsity-aware vector processing in general purpose cpus
CN114078143B (en) Coherency clustering for ray tracing
CN116228634B (en) Distance transformation calculation method, application, terminal and medium for image detection
Zhang et al. Research of Heterogeneous Acceleration Optimization of Convolutional Neural Network Algorithm for Unmanned Vehicle Based on FPGA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20947295

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20947295

Country of ref document: EP

Kind code of ref document: A1