CN114611683A

CN114611683A - Convolutional neural network operation implementation method, device, equipment and storage medium

Info

Publication number: CN114611683A
Application number: CN202210141616.5A
Authority: CN
Inventors: 曹玉龙; 张尧; 孙康睿; 陈标发; 景博
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-06-10

Abstract

The application provides a method, a device, equipment and a storage medium for realizing operation of a convolutional neural network, wherein the method comprises the following steps: respectively reading image data to be processed into a first buffer with a first preset number; respectively reading the weight data of the convolutional neural network into a second buffer with a second preset number; the second preset number is smaller than the first preset number; performing convolution neural network operation of a third preset number of parallelism degrees on the image data to be processed based on the weight data; the third preset number is an integer multiple of the first preset number. The method and the device solve the problem of large calculation delay of the convolutional neural network essentially, and greatly reduce the power consumption and the cost of the processor.

Description

Convolutional neural network operation implementation method, device, equipment and storage medium

Technical Field

The application belongs to the technical field of convolutional neural networks, and particularly relates to a method, a device, equipment and a storage medium for realizing operation of a convolutional neural network.

Background

Convolutional Neural Networks (CNN) is a type of feed-forward Neural network that includes convolution calculations and has a deep structure, and is one of the representative algorithms for deep learning. Because the calculation process of the convolutional neural network model is complex and the processed data is more, most of the convolutional neural network models have the problem of calculation delay at present.

In the prior art, in order to increase the processing speed of the convolutional neural network, the convolutional neural network is usually unloaded from the CPU and accelerated by a heterogeneous processing method. The main heterogeneous processors used at present are GPUs, FPGAs, ASICs and the like. However, the GPU can only perform instruction pipelining, cannot perform data pipelining, and has too high power consumption. The ASIC can only support specific convolution neural network operation, the development period is long, and the algorithm is too late in the era of quick change of the current neural network algorithm.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for realizing the operation of a convolutional neural network, which essentially solve the problem of large calculation delay of the convolutional neural network and greatly reduce the power consumption and the cost of a processor.

An embodiment of a first aspect of the present application provides a convolutional neural network operation implementation method, where the method includes:

respectively reading image data to be processed into a first buffer with a first preset number;

respectively reading the weight data of the convolutional neural network into a second buffer with a second preset number; the second preset number is smaller than the first preset number;

performing convolution neural network operation of a third preset number of parallelism degrees on the image data to be processed based on the weight data; the third preset number is an integer multiple of the first preset number.

In some embodiments of the present application, performing a convolutional neural network operation with a third preset number of parallelism on the image data to be processed based on the weight data includes:

reading out a corresponding number of image data to be processed from each first buffer based on the size of a preset convolution kernel;

and performing convolution operation of a third preset number of parallelism degrees on the read image data to be processed respectively according to the weight data in the second buffer and the preset convolution kernel.

In some embodiments of the present application, performing a convolution operation with a third preset number of degrees of parallelism on the read image data to be processed according to the weight data in the second buffer and the preset convolution kernel includes:

according to a preset convolution kernel, accumulating and calculating the image data to be processed read out from each first buffer and the weight value data of the corresponding number respectively to obtain first intermediate data of a third preset number;

performing accumulation operation on the first intermediate data according to a preset convolution kernel and a preset dependency relationship;

and obtaining a convolution operation result of the image data to be processed based on the results of all the accumulation operations.

In some embodiments of the present application, each buffer stores a fourth predetermined number of rows, where the fourth predetermined number of rows is equal to the second predetermined number, and a product of the fourth predetermined number of rows and the second predetermined number is equal to the first predetermined number;

according to the preset convolution kernel, accumulating the image data to be processed read out from each first buffer and the weight value data of the corresponding number respectively to obtain first intermediate data of a third preset number, and the method comprises the following steps:

multiplying each line of image data to be processed read out from each first buffer with the corresponding weight data of the fifth preset number of lines respectively; the fifth predetermined number is equal to a height of a predetermined convolution kernel.

In some embodiments of the present application, the performing an accumulation operation on the first intermediate data according to a preset convolution kernel and a preset dependency relationship includes:

overlapping the lowest line of the preset convolution kernel with a first target number line of the first intermediate data, and respectively performing accumulation operation on the first target number line of the image data to be processed and each line of the preset convolution kernel;

and sequentially translating the preset convolution kernel by one step length downwards, and performing accumulation operation on the last line of image data to be processed operated last time and each line of the preset convolution kernel respectively until all pixel points are traversed.

In some embodiments of the present application, the obtaining a convolution operation result of the image data to be processed based on results of all the accumulation operations includes:

for each row of pixel points, respectively determining a group of intermediate operation results corresponding to the row of image data to be processed and a plurality of adjacent rows of image data to be processed;

and according to a preset rule, selecting an intermediate operation result from the intermediate operation results corresponding to each row of image data to be processed to form a convolution operation result of the row of pixel points.

In some embodiments of the present application, after reading out a corresponding number of image data to be processed from each of the first buffers, the method further includes:

and caching the image data to be processed read out from each first buffer in sequence by adopting a buffer beating mode.

An embodiment of a second aspect of the present application provides a convolutional neural network operation implementing apparatus, including:

the first reading module is used for respectively reading the image data to be processed into a first buffer with a first preset number;

the second reading module is used for respectively reading the weight data of the convolutional neural network into a second buffer with a second preset number; the second preset number is smaller than the first preset number;

the convolution operation module is used for performing convolution neural network operation of a third preset number of parallelism degrees on the image data to be processed based on the weight data; the third preset number is an integral multiple of the first preset number.

Embodiments of a third aspect of the present application provide an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the computer program to implement the method according to the first aspect.

An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executable by a processor to implement the method according to the first aspect.

The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:

according to the convolutional neural network operation implementation method, the image data to be processed and the weight data are stored in the first buffer and the second buffer respectively, when convolutional operation is needed, the data can be read from the buffers, the convolutional neural network operation with high parallelism is carried out on the image data to be processed in the buffers based on the high-parallelism expansion mode, the overall calculation time of the convolutional neural network operation on the image data is shortened, the problem of large calculation delay of the convolutional neural network is solved essentially, and the power consumption and the cost of a processor are greatly reduced.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings.

In the drawings:

FIG. 1 illustrates a flow diagram of a conventional convolutional neural network operation;

FIG. 2 is a flow chart diagram illustrating a convolutional neural network operation implementation method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a storage manner of image data to be processed;

FIG. 4 is a diagram illustrating a storage manner of weight data;

fig. 5 is a diagram showing reading out of image data to be processed and weight data;

FIG. 6 is an overall expanded schematic diagram illustrating the operation of a convolutional neural network proposed by an embodiment of the present application;

FIG. 7 is a partially expanded schematic diagram illustrating the operation of a convolutional neural network proposed by an embodiment of the present application;

FIG. 8 is a diagram illustrating a detailed development of the convolutional neural network operation proposed in the embodiment of the present application;

fig. 9 is a schematic structural diagram illustrating a convolutional neural network operation implementing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 11 shows a schematic diagram of a storage medium provided in an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.

The basic convolutional neural network calculation flow is shown in fig. 1 (taking one-dimensional convolution as an example, the convolution kernel is 3 × 3, and the step length is 1), for image data to be processed, the convolution kernel usually performs convolution calculation from the upper left corner of the image data, first translates one step length to the right according to the horizontal direction, traverses all pixel points in the horizontal direction, then translates one step length from the leftmost side and downwards, and repeats the above horizontal movement until all pixel points are traversed. For convolution operation with convolution kernel of 3 × 3, 9 times of multiplication operation is required in a single convolution operation, and the result of one pixel point is obtained after all multiplication results are accumulated. Therefore, the calculation amount of the convolution operation is relatively large, and the convolution neural network operation implementation method occupies large power consumption and processing time of a processor.

In view of the foregoing problems, the present embodiment provides a method, an apparatus, a device, and a storage medium for implementing a convolutional neural network operation, where the convolutional neural network operation implementing apparatus may be a server or a processor of the server, and the apparatus may be implemented based on an FPGA (Field Programmable Gate Array), and the convolutional neural network operation implementing method provided in this embodiment is better implemented by using the advantages that the FPGA can be infinitely reprogrammed and the processing speed is fast. The convolutional neural network operation implementation method is based on a high-parallelism expansion method, can perform high-parallelism convolutional operation, and reduces the overall calculation time of the convolutional neural network operation on image data, so that the problem of large calculation delay of the convolutional neural network is solved essentially, and the power consumption and the cost of a processor are greatly reduced.

As shown in fig. 2, the convolutional neural network operation implementation method provided in this embodiment may implement a convolutional neural network operation based on a high-parallelism expansion method, and the method may include the following steps:

step S1, the image data to be processed are respectively read into a first buffer with a first preset number.

Step S2, the weight data of the convolutional neural network are respectively read into a second buffer with a second preset number.

In order to support the unfolding mode with high parallelism, the embodiment designs a storage mode of image data to be processed and weight data (such as weight parameters in the feature extraction process) used in the operation process of the convolutional neural network.

In this embodiment, for convenience of calculation, the image data to be processed and the weight data may be respectively stored in the Buffers (BRAMs) based on the highest parallelism, and specifically, the image data to be processed may be stored in a first predetermined number of first Buffers (BRAMs), and the weight data may be stored in a second predetermined number of second buffers. Usually, the highest parallelism is often an integer multiple of the first predetermined number, and the weight data is smaller than the image data to be processed, and the second predetermined number of the second buffer for storing the weight data can be set smaller than the first predetermined number. For example, the first preset number may be 521, 1024, 2048, etc., and the second preset number may be 16, 32, 64, etc.

In addition, when the image data to be processed and the weight data are stored, a fourth preset number of lines can be stored in each buffer (including the first buffer and the second buffer), the fourth preset number of lines is equal to the second preset number, and the product of the fourth preset number of lines and the second preset number of lines is equal to the first preset number. The fourth predetermined number is typically a multiple of 8, such as 32, 64, etc.

Taking an expansion mode of inputting 32 channels, 32 rows of each channel and 3 × 3 of convolution kernel size as an example, as shown in fig. 3 (data in a box in the figure is an example of image data), image data to be processed is distributed in 1024 first BRAMs, and sequentially comprises 32 rows of data of

channel

0 and 32 rows of data of channel 1 until 32 rows of data of channel 31, and 1024 BRAMs are totally obtained, wherein data of each channel beyond 32 rows are stored in a subsequent space in the same BRAM.

It should be noted that, the storage manner of the image data to be processed and the weight data and the specific values of the first preset number and the second preset number are only preferred embodiments of the present embodiment, and the present embodiment is not limited thereto.

And step S3, performing convolution neural network operation of a third preset number of parallelism degrees on the image data to be processed based on the weight data.

Wherein the third predetermined number is typically an integer multiple of the first predetermined number, the integer multiple typically being a product of convolution kernel sizes. For example, the third predetermined number may be 9(3 × 3) times the first predetermined number in this embodiment.

When the convolutional neural network operation with the third preset number of parallelism is performed, a corresponding number of image data to be processed may be read out from each first buffer based on the size of the preset convolutional kernel.

Similar to the storage mode of the image data to be processed, the storage mode of the weight data is as shown in fig. 4 (the data in the square frame in the figure is an example of the weight data), the weight data is distributed in 32 BRAMs, the weight data from channel 0 to channel 31 is stored in each BRAM in sequence, and the weight data exceeding the 32 channels is stored from the beginning. For the preset convolution kernel with the size of 3 × 3, the storage mode ensures that 32 × 32 data can be read out in one cycle, 32 × 3 data are read out after 3 cycles for parallel computation, and the read data can be buffered by using a register for use in computation.

When the convolutional neural network operation is performed, data is read from each buffer, as shown in fig. 5, data of 32 rows × 32 channels can be read each time, data of 32 rows × 32 channels × 3 channels can be obtained after three clock cycles, and the read data can be buffered by using a register for use in calculation. Similarly, the weight data also adopts the same reading mode, and parallel calculation can be carried out after the data reading is finished. And sequentially caching the image data to be processed read from each first buffer by adopting a register beating mode.

After the data is read out, the read-out image data to be processed can be respectively subjected to convolution operation with a third preset number of parallelism degrees according to the weight data and the preset convolution kernel in the second buffer.

Specifically, when performing convolution operation with a third preset number of parallelism, according to the present embodiment, to-be-processed image data read from each first buffer and weight data with a corresponding number may be accumulated and calculated according to a preset convolution kernel to obtain a third preset number of first intermediate data.

In this embodiment, each line of data may be read, and each line of image data to be processed read from each first buffer is multiplied by the weight data of the corresponding fifth predetermined number of lines. The correspondence between the image data to be processed and the weight data is usually related to the size of the preset convolution kernel, and may be equal to the height size of the preset convolution kernel, that is, the fifth preset number is equal to the height of the preset convolution kernel. For example, each row of read image data to be processed and each row of weight value data in the three rows of weight value data can be accumulated, so that 32 × 3 multiplication calculations can be completed in a single cycle. Compared with the implementation mode of the traditional convolutional neural network operation, the parallelism of the method is 32 × 3 times.

After the first intermediate data obtained by performing the accumulation calculation on the image data to be processed and the weight data is obtained, the accumulation calculation can be performed on the first intermediate data according to a preset convolution kernel and a preset dependency relationship, and then a convolution operation result of the image data to be processed, namely a value of a corresponding pixel point, is obtained based on results of all the accumulation calculations.

The accumulation operation is performed on the first intermediate data according to the preset convolution kernel and the preset dependency relationship, the lowest line of the preset convolution kernel can be overlapped with the first target number line of the first intermediate data, the image data to be processed of the first target number line and each line of the preset convolution kernel are subjected to accumulation operation respectively, then the preset convolution kernel is shifted downwards in sequence by one step length, and the image data to be processed of the last line of operation and each line of the preset convolution kernel are subjected to accumulation operation respectively until all pixel points are traversed.

The convolution operation result of the image data to be processed is obtained based on the results of all the accumulation operations, for each row of pixel points, a group of intermediate operation results corresponding to the row of image data to be processed and a plurality of adjacent rows of image data to be processed respectively can be determined respectively, and then according to a preset rule, one intermediate operation result is selected from the intermediate operation results corresponding to each row of image data to be processed respectively to form the convolution operation result of the row of pixel points.

The predetermined dependency may be as shown in figures 6 and 7,taking the calculation of the preset convolution kernel from top to bottom translation as an example (because of the high parallelism of the embodiment, one row of pixel points can be convolved within the same convolution time, so the convolution kernel can only move up and down without moving horizontally), the third row of the preset convolution kernel and the (n-1) th row of the image data to be processed can be overlapped, and the row and the three rows of the preset convolution kernel respectively perform the accumulation operation to obtain M _n-10、M _n-11、M _n-12 three intermediate operation results. Then, the preset convolution kernel is translated downwards by a step length, the second line and the third line of the preset convolution kernel are overlapped with the (n-1) th line and the (n) th line of the image, and the (n) th line and the three lines of the preset convolution kernel are respectively subjected to accumulation operation to obtain M _n0、M _n1、M _n2 intermediate operation result. The preset convolution kernel continues to move downwards by a step length, the first line, the second line and the third line of the convolution kernel are overlapped with the (n-1) th line, the (n) th line and the (n +1) th line of the image, and the (n +1) th line and the three lines of the preset convolution kernel are respectively subjected to accumulation operation to obtain M _n+10、M _n+11、M _n+12 intermediate operation result.

Wherein M is_n-10、M _n1、M_n+1And adding the three components to obtain the result of the (n, m) pixel point, namely the preset rule can be that other intermediate operation results are partial results of other pixel points, and performing corresponding accumulation according to the position corresponding relation.

In this embodiment, the data is cached in a register beat manner, so as to ensure that 32 × 9 multiplications are calculated in each cycle, a specific calculation process is shown in fig. 8, the image data to be processed is cached in a BRAM (buffer _ img), and the data is cached in a register beat manner (two beats) after being read out. Then, the first three pixel point data in the first row are multiplied by and accumulated with each row of the preset convolution kernel to obtain three intermediate operation results of m00, m01 and m 02. The first three pixel points of the second row are respectively multiplied and accumulated with each row of a preset convolution kernel to obtain three intermediate operation results of m10, m11 and m12, and by analogy, the three intermediate operation results of the third row are m20, m21 and m22, according to the requirement of convolution calculation, the m00, m11 and m22 are accumulated to obtain the result of one pixel point, data is cached in BRAM (buffer _ img _ tmp), and other intermediate operation results are components of other pixel points and are correspondingly accumulated according to the requirement of convolution calculation.

And completing the calculation of the 32 input channels once, when the input data channel is larger than 32, accumulating the data in the buffer _ img _ tmp and the current calculation result, and outputting the result to the buffer _ img for the next convolution calculation after the calculation of all the channels is completed.

In addition, when the calculation of 32 lines of image data to be processed is completed once, it is necessary to distinguish the head 32 lines of data, the middle 32 lines of data (which may include a plurality of middle 32 lines of data) or the tail 32 lines of data of the whole image of the current 32 lines. If the current 32 lines are the 32 lines of data of the head of the whole image, the convolution operation of the 32 th line depends on the intermediate operation result of the 33 th line, so the intermediate operation result needs to be cached, and when the next period of calculation is carried out, the intermediate operation result is accumulated to obtain the convolution operation result of the 32 th line. Similarly, the convolution operation on the 33 th row depends on the intermediate operation result on the 32 th row, so the intermediate operation result of the image data to be processed on the 32 th row is buffered, and when the next cycle of calculation is performed, the buffered intermediate operation result is accumulated with other intermediate operation results to obtain the convolution operation result on the 33 th row. If the current 32 rows are the middle 32 rows of image data to be processed of the whole image, the convolution operation of the first row of the current 32 rows depends on the middle operation data of the 32 th row of the previous round, and the convolution operation of the 32 th row of the current 32 rows depends on the middle operation data of the 1 st row of the next cycle, so as to perform corresponding judgment and perform data caching, as shown in fig. 8. In the next period of calculation, the final convolution operation result is obtained. When 32 lines of data are arranged at the tail part of the whole image in the current 32 lines, the convolution operation of the 32 th line has no other dependency relationship and can be directly output.

According to the convolutional neural network operation implementation method provided by the embodiment, image data to be processed and weight data are stored in the first buffer and the second buffer respectively, when convolutional operation is needed, the data can be read from the buffers, and the convolutional neural network operation with high parallelism is performed on the image data to be processed in the buffers based on an expansion mode with high parallelism, so that the overall calculation time of the convolutional neural network operation on the image data is shortened, the problem of large calculation delay of the convolutional neural network is solved essentially, and the power consumption and the cost of a processor are greatly reduced.

Based on the same concept of the above convolutional neural network operation implementation method, this embodiment further provides a convolutional neural network operation implementation apparatus, which is used to implement the convolutional neural network operation implementation method of any of the above embodiments, as shown in fig. 9, the apparatus includes:

the convolution operation module is used for performing convolution neural network operation of a third preset number of parallelism degrees on the image data to be processed based on the weight data; the third preset number is an integer multiple of the first preset number.

The convolutional neural network operation implementation apparatus provided in this embodiment can at least achieve the beneficial effects of the above coding method based on the same concept of the above coding method, and is not described herein again.

The embodiment of the application further provides electronic equipment for executing the convolutional neural network operation implementation method. Referring to fig. 10, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 10, the electronic apparatus 8 includes: a processor 800, a memory 801, a bus 802 and a communication interface 803, the processor 800, the communication interface 803 and the memory 801 being connected by the bus 802; the memory 801 stores a computer program that can be executed on the processor 800, and when the processor 800 executes the computer program, the convolutional neural network operation implementation method provided by any of the foregoing embodiments of the present application is executed.

The Memory 801 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the apparatus and at least one other network element is realized through at least one communication interface 803 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

Bus 802 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 801 is used for storing a program, and the processor 800 executes the program after receiving an execution instruction, and the convolutional neural network operation implementation method disclosed in any of the embodiments of the present application may be applied to the processor 800, or implemented by the processor 800.

The processor 800 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 800. The Processor 800 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 800 reads the information in the memory 801 and completes the steps of the method in combination with the hardware thereof.

The electronic device provided by the embodiment of the application and the operation implementation method of the convolutional neural network provided by the embodiment of the application have the same inventive concept and have the same beneficial effects as the method adopted, operated or implemented by the electronic device.

Referring to fig. 11, the computer readable storage medium is an optical disc 30, and a computer program (i.e., a program product) is stored on the optical disc 30, and when the computer program is executed by a processor, the computer program will execute the convolutional neural network operation implementation method provided in any of the foregoing embodiments.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the implementation method of the convolutional neural network operation provided by the embodiment of the present application have the same beneficial effects as the method adopted, run, or implemented by the application program stored in the computer-readable storage medium.

It should be noted that:

in the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted to reflect the following schematic: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Moreover, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A convolutional neural network operation implementing method, the method comprising:

2. The method of claim 1, wherein performing a third predetermined number of parallelism of convolutional neural network operations on the image data to be processed based on the weight data comprises:

3. The method according to claim 2, wherein performing a convolution operation with a third predetermined number of degrees of parallelism on the read image data to be processed according to the weight data in the second buffer and the predetermined convolution kernel comprises:

4. The method of claim 3, wherein each buffer stores a fourth predetermined number of rows, wherein the fourth predetermined number of rows equals the second predetermined number, and wherein the product of the fourth predetermined number of rows and the second predetermined number equals the first predetermined number;

5. The method of claim 4, wherein accumulating the first intermediate data according to a predetermined convolution kernel and a predetermined dependency comprises:

6. The method of claim 5, wherein the obtaining a convolution operation result of the image data to be processed based on the results of all accumulation operations comprises:

7. The method according to any of claims 2-6, wherein after reading out a corresponding number of image data to be processed from each first buffer, further comprising:

and caching the image data to be processed read out from each first buffer in sequence by adopting a register beating mode.

8. An apparatus for performing convolutional neural network operations, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-7.