CN114201726A

CN114201726A - Convolution operation optimization method, system, terminal and storage medium

Info

Publication number: CN114201726A
Application number: CN202010986153.3A
Authority: CN
Inventors: 王峥; 廖健; 刘江佾
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Zhongke Yuanwuxin Technology Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2022-03-18
Anticipated expiration: 2040-09-18
Also published as: WO2022057054A1; CN114201726B

Abstract

The application relates to a convolution operation optimization method, a convolution operation optimization system, a convolution operation optimization terminal and a convolution operation optimization storage medium. The method comprises the following steps: inputting the image data in the data memory module into the multithreading data cache module, and recording the data characteristics of the image data in each thread; when all threads are filled with image data, performing space-time similarity analysis on the data characteristics of at least two adjacent threads respectively, filtering out the image data of at least one of the at least two adjacent threads when the data characteristics of the at least two adjacent threads have space-time similarity, taking the threads with the filtered image data as idle threads to re-cache the image data input by the data memory module, and performing the space-time similarity analysis again when all threads are filled with the image data again; and performing convolution calculation according to the cached image data, and outputting new image data. The method and the device greatly reduce the actual convolution operation amount, improve the data reusability, reduce the overall network calculation time and improve the chip performance.

Description

Convolution operation optimization method, system, terminal and storage medium

Technical Field

The application belongs to the technical field of deep learning, and particularly relates to a convolution operation optimization method, a convolution operation optimization system, a convolution operation optimization terminal and a convolution operation optimization storage medium.

Background

In recent years, due to the popularization of big data application and the progress of computer hardware, deep learning technology is used to perform feature extraction, classification and recursive operation on data, and has wide application in the fields of computer vision, natural language processing, intelligent system decision and the like. Convolution operation is a very important deep learning feature extraction method, for example, deep learning neural networks such as LeNet1, AlexNet, VGG-16, VGG-19 and the like which are mainstream at present are all stacked by layers of convolution layers, and the accuracy of classification is improved along with the improvement of the number of network layers. However, since the computational power and speed of the general-purpose computer platform cannot keep pace with the large amount of computational power consumed by the convolution operation itself, a special convolution calculation chip needs to be designed.

In the prior art, the performance of a chip is improved by increasing computing nodes, increasing data cache, converting data types and the like, and accompanying with the rapid increase of parameters and calculated amount, the data bandwidth and the calculating capacity of a hardware platform are required to be higher. However, the existing architecture increases the computation power by increasing the operation frequency and increasing the number of computation and storage modules, and has faced the problems of low computation module utilization rate, high implementation cost, limited communication bandwidth, poor expandability and large energy waste.

Disclosure of Invention

The application provides a convolution operation optimization method, a convolution operation optimization system, a convolution operation optimization terminal and a convolution operation optimization storage medium, and aims to solve at least one of the technical problems in the prior art to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

a convolution operation optimization method comprises the following steps:

inputting the image data in the data memory module into the multithreading data cache module, and recording the data characteristics of the image data in each thread;

when all threads in the multi-thread data cache module are filled with image data, respectively carrying out space-time similarity analysis on the data characteristics of at least two adjacent threads, and when the data characteristics of at least two adjacent threads have space-time similarity,

filtering out image data of at least one thread in the at least two adjacent threads, taking the thread after filtering the image data as an idle thread to re-cache the image data input by the data memory module, and re-performing the spatiotemporal similarity analysis when all threads are filled with the image data again until all threads in the multi-thread data cache module are filled with the image data and the data characteristics of the at least two adjacent threads do not have spatiotemporal similarity,

and performing convolution calculation according to the image data cached in the multithreading data caching module, and outputting new image data.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the data features include pixel maxima, minima, and means of the image data.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the performing the spatiotemporal similarity analysis on the data characteristics of at least two adjacent threads in the multithread data cache module respectively comprises:

grouping all threads in the multithreading data caching module, wherein each group comprises at least two adjacent threads;

and respectively carrying out time and space similarity analysis on the data characteristics of at least two adjacent threads in each group, and if the maximum value, the minimum value and the mean value difference in the data characteristics of the at least two adjacent threads are smaller than a set threshold value, judging that the image data in the at least two adjacent threads have space-time similarity.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the performing spatiotemporal similarity analysis on the data characteristics of at least two adjacent threads in the multithreaded data caching module respectively further comprises:

and if the differences among the maximum value, the minimum value and the mean value of at least two adjacent threads in each group are smaller than a set similarity threshold, judging that the image data in the at least two adjacent threads are similar.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the filtering out the image data of at least one thread of the at least two adjacent threads and continuing to cache the image data according to the idle thread after filtering the image data comprises:

reordering the idle threads after the image data are filtered;

and continuing to cache the image data from the reordered first thread.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the image data input multithread data cache module further comprises:

the line program number of each buffer memory with image data is recorded by an address register, and a line program number list for recording the buffer memory position of the image data is generated.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the filtering out image data of at least one of the at least two adjacent threads further comprises:

and deleting the line program number of the filtered data in the line program number list.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the performing convolution calculation according to the image data cached in the multithread data caching module includes:

and outputting the new image data and the line program number list to the data memory module.

Another technical scheme adopted by the embodiment of the application is as follows: a convolution operation optimization system comprising:

a multithreading data caching module: the system comprises a cache, a data processing unit and a data processing unit, wherein the cache is used for caching image data and recording the data characteristics of the image data in each thread;

a data filtering module: when all threads in the multi-thread data cache module are filled with image data, performing spatiotemporal similarity analysis on data features of at least two adjacent threads respectively, filtering out image data of at least one thread in the at least two adjacent threads when the data features of the at least two adjacent threads have spatiotemporal similarity, taking the thread after filtering the image data as an idle thread to re-cache the image data, and performing the spatiotemporal similarity analysis again when all threads are filled with the image data again until all threads in the multi-thread data cache module are filled with the image data and the data features of the at least two adjacent threads do not have spatiotemporal similarity;

a convolution operation module: and the image data cache module is used for performing convolution calculation according to the image data cached in the multithread data cache module and outputting new image data.

The embodiment of the application adopts another technical scheme that: a storage medium storing program instructions executable by a processor to perform the method of optimizing convolution operations.

Compared with the prior art, the embodiment of the application has the advantages that: the convolution operation optimization method, the system, the terminal and the storage medium of the embodiment of the application utilize the similarity of input data in time and space, and screen and filter a part of similar data in a data caching stage, so that the actual convolution operation amount is greatly reduced, the data reusability is improved, the maximum utilization of thread resources is achieved, the overall network calculation time is reduced, and the chip performance is improved.

Drawings

FIG. 1 is a flow chart of a convolution operation optimization method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a convolution operation optimization system according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Because the special convolution computing chip can relate to reading of image memory data, about 80% of energy is consumed in data transmission for the chip, and therefore the reusability of data is improved by optimizing a data cache storage mode, the power consumption of the chip can be greatly reduced, and the performance of the chip is improved. Based on the above, the convolution operation optimization method in the embodiment of the application is based on the intrinsic characteristic analysis of data, and utilizes the similarity of input data in time and space to screen and filter a part of data in the data caching stage, so that the actual calculation quantity of the convolution operation which takes the most time in a network is greatly reduced, the energy consumption is reduced, and the performance of a convolution calculation chip is improved.

Specifically, please refer to fig. 1, which is a flowchart illustrating a convolution operation optimization method according to an embodiment of the present application. The convolution operation optimization method comprises the following steps:

s1: inputting the image data in the data memory module into the multithreading data cache module, and simultaneously recording data characteristics of the image data in each thread, such as pixel maximum value, minimum value, mean value and the like;

s2: recording the line program number of each image data cached through an address register, and generating a line program number list for recording the caching position of the image data;

s3: performing space-time similarity analysis and comparison on data characteristics in at least two adjacent threads in the multithread data cache module, judging whether the data characteristics in the adjacent threads are similar, if so, executing S4, otherwise, continuing to execute S3;

in this step, the analysis of the spatio-temporal similarity specifically comprises: taking 64 threads as an example, in the time dimension, the 64 threads synchronously record the data characteristics flushed into each buffer while caching data, and when the data caching is finished, the data characteristics in each thread are completely updated; in the spatial dimension, dividing 64 threads into 16 groups, wherein each group comprises 4 adjacent threads, and each thread is a sliding window; performing spatiotemporal similarity analysis on data characteristics in 4 adjacent threads (namely, every 4 adjacent sliding windows) in each group, respectively comparing the data characteristics of the second, third and fourth threads in each group with the data characteristics of the first thread, judging whether the data characteristics of the second, third and fourth threads are similar to the data characteristics of the first thread, marking the first thread and the threads which are not similar to the first thread in each group as valid threads, marking the threads which are similar to the first thread as invalid threads, and replacing the calculation results of the invalid threads with the calculation results of the first thread during subsequent thread calculation.

The method for judging whether the data features are similar further comprises the following steps: and if the maximum value, the minimum value and the mean value in the data characteristics of a certain thread and the first thread in each group are close to a certain set threshold value, or the difference between the maximum value, the minimum value and the mean value of the certain thread and the first thread is less than a set similarity threshold value, the image data cached in the thread is considered to have similarity in the image data of the first thread.

It can be understood that the thread grouping manner can be adjusted according to actual operation, or adjusted to be less or more sliding windows per group for spatiotemporal similarity comparison, or adjusted to be pairwise for every two adjacent sliding windows for spatiotemporal similarity comparison.

S4: filtering out image data in at least one of the at least two adjacent threads, updating the thread program number list (i.e. deleting the number of the idle thread from which the data is filtered), and re-executing S3;

s5: judging whether the first round of space-time similarity comparison of all threads in the multi-thread data cache module is finished or not, if so, executing S6; otherwise, execution continues with S3;

s6: reordering idle threads after data filtering in the multithreading data cache module;

in this step, taking 64 threads as an example, the idle threads after data filtering are reordered from 65.

S7: continuing caching the image data from the first sorted thread, and re-executing S2-S6;

in this step, taking 64 threads as an example, after the first round of similarity comparison is finished, the image data continues to be cached from the 65 th sliding window, and the line program number of each cached image data continues to be recorded in the line program number list through the address register.

S8: judging whether all threads in the multithread data cache module are full, if not, continuing to execute S7; otherwise, go to S9;

s9: and the convolution calculation module performs convolution calculation according to the image data in the multithread data cache module and outputs new image data and a line program number list to the data memory module.

Based on the above, the embodiment of the application performs the time-space similarity analysis on the artificial intelligence data, and applies the artificial intelligence data to the data cache module before the convolution calculation, so that the actual convolution operation amount is greatly reduced, the data reusability is improved, the maximum utilization of thread resources is achieved, the overall calculation time of a network is reduced, and the performance of a chip is improved. Compared with other traditional convolutional neural network acceleration modes, the method is simple in algorithm, good in practicability, applicable to various convolutional types and acceleration algorithms, capable of adjusting the similarity threshold value according to input data under different conditions, and capable of accelerating operation of the convolutional neural network on the premise of not losing accuracy. In addition, the method and the device have the advantages that from the characteristics of input data, for the first-layer convolutional layer, the situation that the input data (such as background images, images of monitoring videos and the like) are single and unchanged can be achieved, and the method and the device have great potential for the situation that the depth of a convolutional kernel in a deep neural network is large.

The following examples demonstrate the feasibility and effectiveness of this protocol by experimentation. The experimental scheme is realized by adopting a Verilog HDL language, and the feasibility and the running time of the scheme are simulated and verified by adopting a Modelsim simulation tool. The method specifically comprises the following steps: configuring a configuration file aiming at a specific neural network, and writing image data into a memory; the experiment takes 64 threads, an input image size of 28x28 channel number of 64, an output image size of 28x28 channel number of 128 and a convolution kernel size of 5x5 as examples, the scheme is added into a data cache, and the 64 threads are filled after several rounds of fetching and then start convolution operation; when the experiment is finished, the memory is observed through Modelsim to record the experiment result, and the experiment result shows that the data caching frequency required by processing one picture is reduced, the calculation frequency after caching is reduced, and the time for processing a single picture is reduced. Therefore, the experimental result fully verifies the advantages of simple algorithm, low complexity and high efficiency of the scheme.

Please refer to fig. 2, which is a schematic structural diagram of a convolution operation optimization system according to an embodiment of the present application. The convolution operation optimization system 40 according to the embodiment of the present application includes:

the multithreaded data cache module 41: the system comprises a cache, a data processing unit and a data processing unit, wherein the cache is used for caching image data and recording the data characteristics of the image data in each thread;

the data filtering module 42: when all threads in the multi-thread data cache module are filled with image data, performing spatiotemporal similarity analysis on data features of at least two adjacent threads respectively, filtering out image data of at least one thread in the at least two adjacent threads when the data features of the at least two adjacent threads have spatiotemporal similarity, taking the thread after filtering the image data as an idle thread to re-cache the image data, and performing the spatiotemporal similarity analysis again when all threads are filled with the image data again until all threads in the multi-thread data cache module are filled with the image data and the data features of the at least two adjacent threads do not have spatiotemporal similarity;

convolution operation module 43: and the image data cache module is used for performing convolution calculation according to the image data cached in the multithread data cache module and outputting new image data.

Please refer to fig. 3, which is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the methods described above, where the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A convolution operation optimization method is characterized by comprising the following steps:

2. The method of claim 1, wherein the data features comprise pixel maxima, minima, and means of the image data.

3. The method of claim 2, wherein the performing spatiotemporal similarity analysis on the data characteristics of at least two adjacent threads in the multithreaded data cache module respectively comprises:

4. The method of claim 2, wherein the performing spatiotemporal similarity analysis on the data characteristics of at least two adjacent threads in the multithreaded data cache module respectively further comprises:

5. The method of claim 1, wherein filtering out image data of at least one of the at least two adjacent threads and continuing to buffer the image data according to an idle thread after filtering the image data comprises:

reordering the idle threads after the image data are filtered;

and continuing to cache the image data from the reordered first thread.

6. The method of any of claims 1 to 5, wherein the inputting image data into a multithreaded data caching module further comprises:

7. The method of claim 6, wherein filtering out image data of at least one of the at least two adjacent threads further comprises:

8. The method of claim 7, wherein performing convolution calculations based on image data cached in the multithreaded data caching module comprises:

9. A system for optimizing convolution operations, comprising:

10. A storage medium having stored thereon program instructions executable by a processor to perform the method of optimizing convolution operations according to any one of claims 1 to 8.