WO2022027818A1

WO2022027818A1 - Data batch processing method and batch processing apparatus thereof, and storage medium

Info

Publication number: WO2022027818A1
Application number: PCT/CN2020/120177
Authority: WO
Inventors: 王峥; 雷明
Original assignee: 深圳先进技术研究院
Priority date: 2020-08-07
Filing date: 2020-10-10
Publication date: 2022-02-10
Also published as: CN114065905A

Abstract

A data batch processing method and a batch processing apparatus thereof, and a storage medium. The data batch processing method comprises: obtaining memory bandwidth and selecting, according to the memory bandwidth, raw channel data of N continuous image frames (S10); joining the raw channel data of the N continuous image frames to form multiple reorganized data strips, each reorganized data strip comprising the raw channel data of the N continuous image frames at the same pixel position (S20); and sequentially inputting the multiple reorganized data strips into a parallel computing unit array for convolution operation, all of the raw channel data of the same reorganized data strip entering a computing unit at the same time (S30). Image channel data of multiple adjacent picture frames at the same pixel position is renewed, reorganized data is inputted into a calculation unit at the same time, and convolution operation is performed with the same weight data. As such, the number of times weight data and image data is read can be reduced, and the calculation consumption can be greatly reduced.

Description

Data batch processing method, batch processing device and storage medium

technical field

The present invention belongs to the technical field of data processing, and in particular, relates to a data batch processing method for neural networks, a batch processing device thereof, and a computer-readable storage medium.

Background technique

With the promotion of big data and artificial intelligence technologies, deep learning algorithms based on artificial neural networks have achieved remarkable results in the fields of computer vision, natural language processing, and autonomous decision-making by agents, relying on their powerful feature extraction capabilities. However, the structure of neural network is becoming more and more complex, which is accompanied by a sharp increase in the amount of parameters and calculation, which has higher requirements on the data bandwidth and computing power of the hardware platform.

Among them, continuous image processing technologies, such as target recognition, tracking, and super-resolution reconstruction in video streams, play an important role in intelligent applications. Today's mainstream deep learning accelerators have a very good speed-up effect on the intelligent processing of single-frame images. However, for video applications, the direct application of single-frame acceleration technology will cause a huge waste of computing resources, especially a large number of repeated off-chip memory reads. write operation. The core reason is unoptimized memory operations such as repeated weight reading between different frame images and discrete data reading.

SUMMARY OF THE INVENTION

(1) Technical problem to be solved by the present invention

The technical problem solved by the present invention is: how to reduce the number of times of reading data from the memory.

(2) Technical scheme adopted in the present invention

A data batch processing method for neural networks, the data batch processing method comprising:

Obtain memory bandwidth and select the original channel data of N continuous frame images according to the memory bandwidth;

The original channel data of described N continuous frame images is spliced, and multiple reorganization data strips are formed, wherein each part of the reorganized data strip includes the original channel data of described N continuous frame images on the same pixel position;

A plurality of reconstituted data strips are sequentially input to the parallel computing unit array for convolution operation, wherein all the original channel data of the same reconstituted data strip enter the computing unit at the same time.

Optionally, each piece of the restructured data strip further includes zero-padding data, and the data bit width of each piece of the restructured data strip is equal to the memory bandwidth.

Optionally, the data batch processing method further includes:

The plurality of reconstituted data strips are stored in the memory.

Optionally, the method for sequentially inputting multiple pieces of recombined data strips into the parallel computing unit array for convolution operation includes:

Multiply and add the original channel data of each continuous frame image in each of the reorganized data strips with the same weight data respectively;

The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.

Optionally, the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each consecutive frame image includes red channel data, green channel data and blue channel data.

The present application also discloses a data batch processing device for a neural network, the data batch processing device comprising:

a data acquisition module for acquiring memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;

A data reorganization module for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channels of the N continuous frame images at the same pixel position data;

The convolution calculation module is used to read multiple reorganized data strips and perform convolution operation on the multiple reorganized data strips in sequence, wherein all the original channel data of the same reorganized data strip are read by the convolution calculation module at the same time Pick.

Optionally, the data batch processing device further includes a memory, and the memory is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module.

Optionally, the convolution calculation module includes:

A multiplier-adder unit, used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;

The storage unit is used for storing the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip.

The present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.

(3) Beneficial effects

The invention discloses a data batch processing method for neural network, which has the following technical effects compared with the traditional calculation method:

(1) The optimized data structure for three-dimensional arrays can realize fast data buffering and avoid repeated weight reading between different frame images, thereby greatly reducing the number of off-chip memory accesses;

(2). The idea of this method is novel. Starting from the characteristics of the input data itself, for the first convolutional layer, the input data such as background images and surveillance videos can play a great role in the case where the images are single and unchanged. Larger cases also have great potential.

Description of drawings

FIG. 1 is a flowchart of a data batch processing method for a neural network according to Embodiment 1 of the present invention;

Fig. 2 is the flow chart of the convolution calculation of Embodiment 1 of the present invention;

3 is a schematic diagram of a data splicing process according to Embodiment 1 of the present invention;

4 is a schematic diagram of a data batch processing apparatus according to Embodiment 2 of the present invention;

5 is a schematic diagram of a parallel computing unit array according to Embodiment 2 of the present invention;

FIG. 6 is a schematic diagram of a computer device according to an embodiment of the present invention.

detailed description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

Before describing the various embodiments of the present application in detail, first briefly describe the inventive concept of the present application: in the prior art, the convolution calculation is performed on each frame of pictures in turn, and it is necessary to repeatedly read the weight data and read the image data multiple times, It will cause a waste of computing resources. In this application, the weight data corresponding to the same pixel position is the same in the convolution calculation process, and the image channel data of the adjacent multi-frame pictures at the same pixel position is rewritten, and the reorganized data is in the same pixel position. It is input into the computing unit at all times, and the convolution operation is performed with the same weight data, which can reduce the number of readings of the weight data and image data, and greatly reduce the computing energy consumption.

Example 1

Specifically, as shown in FIG. 1 , the data batch processing method for a neural network in the first embodiment includes the following steps:

Step S10: obtaining the memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;

Step S20: splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channel data of the N continuous frame images at the same pixel position;

Step S30: Inputting multiple pieces of recombined data strips into the parallel computing unit array in sequence for convolution operation, wherein all the original channel data of the same piece of recombined data strips enter the computing unit at the same time.

In step S10, taking the memory bandwidth equal to 128 bits as an example, in the prior art, when reading data from the memory, the original channel data of one pixel point is read only each time, including the red channel data R and the green channel data G And blue channel data B, each color channel data occupies 8 bits, a total of 24 bits, so only 24 bits of data are read each time, wasting memory bandwidth. In this embodiment, according to the characteristics of the convolution calculation process, combined with the characteristics of the image data, and according to the size of the memory bandwidth actually used, the original channel data of N consecutive frame images are selected and spliced, so that each time the data is read from the memory, it can be Read more channel data and improve the efficiency of memory bandwidth usage.

Exemplarily, taking the memory bandwidth equal to 128 bits and N equal to 5 as an example, the splicing process in step S20 will be described in detail. Splicing the original channel data of 5 consecutive frame images at the same pixel position to form repeated data strips. For example, splicing the original channel data of 5 consecutive frame images at the first pixel point to form a data bit width of 120 bits. Reorganize the data bar. As a preferred embodiment, zero-fill processing is performed on the formed reconstituted data bar, so that the data bit width of the reconstituted data bar is equal to the memory bandwidth. A reconstituted strip of bits.

As a preferred embodiment, the block memory (block memory) of Xilinx company model 128-32 is used to read the original channel data of each image, but the block memory can only read four color channel data each time, namely 32-bit data, but really need to use 24-bit data, so the data after the block memory read needs to continue to be reorganized. Exemplarily, the original channel data of 5 consecutive frame images are temporarily stored in 5 buffers respectively, as shown in Figure 3, the color channel data are arranged in sequence according to the pixel position order, that is, R ₀ G ₀ B ₀ R ₁ G ₁ B ₁ R ₂ G ₂ B ₂ R ₃ G ₃ B ₃ ......, when using block memory to read, read R ₀ G ₀ B ₀ R ₁ , G ₁ in sequence in the first cache B ₁ R ₂ G ₂ , B ₂ R ₃ G ₃ B ₃ , set a first register, store the R ₁ read for the first time in the first register for the next splicing, and store the R 1 of the 5 images in the first register for the next splicing. ₀ G ₀ B ₀ is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ 0, and set the second register to store the reorganized data bar, so as to complete the reorganization of the original channel data of the first pixel. Similarly, when recombining the original channel data of the second pixel, recombine the G ₁ B ₁ read for the second time with R ₁ in the first register to form the recombination of the second pixel Data strip, namely R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ 0, and store the R ₂ G ₂ read for the second time in the first register for the next reorganization. Similarly, when recombining the original channel data of the third pixel point, the B ₂ read for the third time and the R ₂ G ₂ stored in the first register are spliced to form the recombination of the third pixel point Data bar, namely R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ 0, and R ₃ of each image read for the third time G ₃ B ₃ is spliced to form the recombined data bar of the fourth pixel, namely R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ 0 , so that after three readings, the recombination of the original channel data of the four pixel points can be completed to form four repeated data strips.

The above steps are repeated until the recombination of the original channel data of all the pixels of the 5 consecutive frame images is completed, and all the recombined data strips obtained are stored in the memory for use in subsequent calculations.

In step S30, a 32*64 parallel computing unit array is taken as an example, including 32*64 computing units, 64 data cache TBs and 32 weight caches WB, wherein each data cache TB stores multiple pieces of reorganized data The weight data stored in each weight cache WB is shared by 64 data caches.

As a preferred embodiment, taking the size of the sliding window equal to 2*2 in the convolution calculation process as an example, each data buffer stores the recombined data strips of four adjacent pixels, that is, R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ 0, R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ 0, R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ 0 and R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ 0. When performing convolution calculations, writing the same reconstituted data bar into the computing unit at the same time can improve the efficiency of memory bandwidth utilization on the one hand, and reduce the number of memory reads on the other hand.

Specifically, as shown in FIG. 2 , a method for sequentially inputting multiple pieces of recombined data strips into a parallel computing unit array for convolution operation includes:

Step S31: Multiply and add the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;

Step S32: Store the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip into different registers.

Exemplarily, taking 5 consecutive frame images as an example, each piece of reconstructed data bar includes the original channel data of 5 pixels of the 5 consecutive frame images, wherein 5 third registers are set, respectively used to store 5 consecutive frames. The result of the multiply-add operation of the image. As shown in Figure 5, for example, for the original channel data of the first pixel of the first continuous frame image, the result of the multiplication and addition operation is F ₀ =W ₀₀ *R ₀ +W ₀₁ *G ₀ +W ₀₂ * B ₀ , and store the calculation result in the corresponding third register, wait for the multiplication and addition operation results of all the pixels in the sliding window of the first continuous frame image to be obtained, and then further add the multiplication and addition operation results . Similarly, the result of the multiplication and addition operation of the first pixel of the second continuous frame image is F ₁ =W ₀₀ *R ₀ +W ₀₁ *G ₀ +W ₀₂ *B ₀ , and the multiplication and addition operation result is Stored in the corresponding third register, and so on, each calculation result is stored in a different third register. In the process of convolution calculation, since the same reorganized data bar corresponds to the same weight data, the sharing of weight data can be realized, and there is no need to repeatedly read the weight data. Reading into the computing unit also avoids repeated reading of image data and reduces the number of memory accesses.

Embodiment 2

As shown in FIG. 4 , the apparatus for batch processing data for neural networks according to the second embodiment includes a data acquisition module 100, a data reorganization module 200 and a convolution calculation module 300. The data acquisition module 100 is used for acquiring memory bandwidth and according to the The memory bandwidth selects the original channel data of N continuous frame images; the data reorganization module 200 is used for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the described The original channel data of N consecutive frame images at the same pixel position; the convolution calculation module 300 is used to read multiple pieces of reconstructed data strips and perform convolution operation on the multiple pieces of reconstructed data strips in sequence, wherein the same piece of reconstructed data strips All raw channel data are read by the convolution calculation module at the same time. The data batch processing apparatus further includes a memory 400, and the memory 400 is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module 200.

Specifically, the data acquisition module 100 includes a plurality of buffers, and the plurality of buffers are used to read and temporarily store the original channel data of the corresponding image from the memory module 400 according to the data of the memory bandwidth. Taking the memory bandwidth equal to 128 bits and N equal to 5 as an example, 5 different buffers are used to read and store the original channel data of 5 consecutive frame images from the memory, and the color channel data are arranged in sequence according to the pixel position, that is R ₀ G ₀ B ₀ R ₁ G ₁ B ₁ R ₂ G ₂ B ₂ R ₃ G ₃ B ₃ .

The data reorganization module 200 includes a block memory, a first register, a second register and a counter. Exemplarily, the block memory adopts the Block Memory of Xilinx Company, whose model is 128-32. The block memory read can only read four color channel data each time, that is, 32-bit data, and the real need is 24-bit data, so the data after the block memory read needs to continue to be reorganized. When the data is read for the first time, the block memory reads the data from each buffer respectively as R ₀ G ₀ B ₀ R ₁ . At this time, R ₁ is stored in the first register _. ₀ B ₀ is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ R ₀ G ₀ B ₀ 0, and store it in the second register, while setting the value of the counter to 0. Similarly, when the data is read for the second time, the data read from the respective buffers by the block memory is G ₁ B ₁ R ₂ G ₂ . At this time, R ₂ G ₂ is stored in the first register, and the The pre-stored R ₁ in the first register and the G ₁ B ₁ read for the second time are spliced and zero-filled to form the reorganized data bar of the second pixel, that is, R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ --R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ R ₁ G ₁ B ₁ 0, and store it in the second register, while setting the value of the counter to 1. When the data is read for the third time, the data read from the respective _buffers by the block memory is B ₂ R ₃ _G ₃ B ₃ . B ₂ is spliced and zero-filled to form a reorganized data bar of the third pixel, namely R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ R ₂ G ₂ B ₂ 0 and store it in the second register, while setting the value of the counter to 2. Next, splicing the R ₃ G ₃ B ₃ read for the third time, and performing zero-fill processing to form a reorganized data bar of the fourth pixel point, that is, R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ R ₃ G ₃ B ₃ 0, and store it in the second register, while setting the value of the counter to 3. In this way, after three readings, the recombination of the original channel data of the four pixel points can be completed to form four duplicate data strips. The above steps are repeated until the recombination of the original channel data of all the pixels of the 5 consecutive frame images is completed, and all the recombined data strips obtained are stored in the memory for use in subsequent calculations.

Further, as shown in FIG. 5 , taking a 32*64 parallel computing unit array as an example, it includes 32*64 computing units PE, 64 data cache TBs and 32 weight caches WB, where each data cache TB contains There are multiple pieces of reorganized data stored, and the weight data stored in each weight buffer WB is shared by 64 data buffers, and the convolution calculation module is the calculation unit PE.

Wherein, the convolution calculation module includes a multiplier-adder unit and a storage unit, and the multiplier-adder unit is used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively, The storage unit is used to store the result of the multiplication and addition operation of each successive frame image in each piece of the reconstructed data strip.

Exemplarily, the multiplier-adder unit includes a multiplier and an adder, and the storage unit includes a data selector 301 , a data distributor 302 and five third registers 303 . For example, for the original channel data of the first pixel of the first continuous frame image, use the multiplier

Calculate W ₀₀ *R ₀ and use the data selector 301 to read the data from the corresponding third register 303, use the adder

After calculation, since the initial value of the third register is zero, the calculation result of the adder is W ₀₀ *R ₀ , and then the calculation result W ₀₀ *R ₀ is stored in the third register 303 through the data distributor 302 . keep using the multiplier

Calculate W ₀₁ *G ₀ , and use data selector 301 to read data W ₀₀ *R ₀ from the corresponding third register 303 , use adder

calculation, adder

The calculation result is W ₀₀ *R ₀ +W ₀₁ *G ₀ , and then the calculation result W ₀₀ *R ₀ +W ₀₁ *G ₀ is stored in the third register 303 through the data distributor 302 . Finally, continue to use the multiplier

Calculate W ₀₂ *B ₀ , and use the data selector 301 to read the data W ₀₀ *R ₀ +W ₀₁ *G ₀ from the corresponding third register 303 , use the adder

calculation, adder

The calculation result is F ₀ =W ₀₀ *R ₀ +W ₀₁ *G ₀ +W ₀₂ *B ₀ , and then the calculation result F ₀ is stored in the third register 303 through the data distributor 302 . By analogy, the convolution calculation of each original channel data is completed. Since the same reorganized data bar corresponds to the same weight data, for example, W ₀₀ W ₀₁ W ₀₂ needs to be reused five times, additional address pointers and counters can be set. When a set of weight data is used less than five times, the address pointer must be controlled. to reuse weight data.

The present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.

The present application also discloses a computer device. At the hardware level, as shown in FIG. 6 , the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 . The processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device. The computer-readable storage medium 11 stores a data batch program for a neural network, and when the data batch program for a neural network is executed by a processor, implements the above-mentioned data batch method for a neural network.

Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.

The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that the principles and spirit of the present invention, which are defined in the scope of the claims and their equivalents, are not departed from. Under the circumstances, these embodiments can be modified and perfected, and these modifications and improvements should also fall within the protection scope of the present invention.

Claims

A data batch processing method for a neural network, wherein the data batch processing method comprises:

Obtain memory bandwidth and select the original channel data of N continuous frame images according to the memory bandwidth;

The original channel data of the N continuous frame images are spliced to form a plurality of reorganized data strips, wherein each reorganized data strip includes the original channel data of the N continuous frame images at the same pixel position;

A plurality of reconstituted data strips are sequentially input to the parallel computing unit array for convolution operation, wherein all the original channel data of the same reconstituted data strip enter the computing unit at the same time.
The data batch processing method for neural networks according to claim 1, wherein each piece of the restructured data strip further includes zero-padding data, and the data bit width of each piece of the restructured data strip is equal to the memory bandwidth.
The data batch processing method for neural networks according to claim 2, wherein the data batch processing method further comprises:

The plurality of reconstituted data strips are stored in the memory.
The data batch processing method for neural networks according to claim 1, wherein the method for sequentially inputting multiple pieces of recombined data strips into a parallel computing unit array for convolution operation comprises:

Multiply and add the original channel data of each continuous frame image in each of the reorganized data strips with the same weight data respectively;

The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.
The data batch processing method for a neural network according to claim 1, wherein the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each of the continuous frame images includes red Channel data, green channel data, and blue channel data.
A data batch processing device for a neural network, wherein the data batch processing device comprises:

a data acquisition module for acquiring memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;

A data reorganization module for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channels of the N continuous frame images at the same pixel position data;

The convolution calculation module is used to read multiple reorganized data strips and perform convolution operation on the multiple reorganized data strips in sequence, wherein all the original channel data of the same reorganized data strip are read by the convolution calculation module at the same time Pick.
The data batch processing device for neural networks according to claim 6, wherein the data batch processing device further comprises a memory, and the memory is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module.
The data batch processing apparatus for neural networks according to claim 6, wherein the convolution calculation module comprises:

A multiplier-adder unit, used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;

The storage unit is used for storing the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip.
A computer-readable storage medium, wherein the computer-readable storage medium stores a data batch processing program for a neural network, and the data batch processing program for a neural network implements the method of claim 1 when executed by a processor Data batching methods for neural networks.
The computer-readable storage medium of claim 9, wherein each of the reconstituted data stripes further includes zero-padded data, and a data bit width of each of the reconstituted data strips is equal to the memory bandwidth.
The computer-readable storage medium of claim 10, wherein the data batch processing method further comprises:

The plurality of reconstituted data strips are stored in the memory.

The method for sequentially inputting multiple pieces of recombined data strips into the parallel computing unit array for convolution operation includes:
The computer-readable storage medium according to claim 9, wherein the original channel data of each continuous frame image in each of the recombined data strips is respectively multiplied and added with the same weight data;

The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.
The computer-readable storage medium according to claim 9, wherein the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each successive frame image comprises red channel data, green Channel data and blue channel data.